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Resume : On presente dans ce rapport une factorisation LU qui est plus stable que l'elimination 
de Gauss avec pivotage partiel (GEPP) en terme de facteur de croissace. Cette factorisation permet 
de resoudre des matrices pathologiques ou GEPP echoue (Wilkinson, Foster). On presente aussi 
la version minimisant la communication de ce nouvel algorithmc. 

Mots-cles : Factorisation LU, stabilite numerique, minimisation de la communication, factori- 
sation QR 
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Abstract: Wc present the LU decomposition with panel rank revealing pivot- 
ing (LtLPRRP), an LU factorization algorithm based on strong rank revealing 
QR panel factorization. LU_PRRP is more stable than Gaussian elimination 
with partial pivoting (GEPP), with a theoretical upper bound of the growth 
factor of (1 + t&)¥ , where b is the size of the panel used during the block factor- 
ization, t is a parameter of the strong rank revealing QR factorization, and n 
is the number of columns of the matrix. For example, if the size of the panel is 
b = 64, and r = 2, then (1 + 2b) n ' b = (1.079)" < 2"" 1 , where 2™- 1 is the upper 
bound of the growth factor of GEPP. Our extensive numerical experiments show 
that the new factorization scheme is as numerically stable as GEPP in practice, 
but it is more resistant to pathological cases and easily solves the Wilkinson 
matrix and the Foster matrix. The LILPRRP factorization does only 0(n 2 b) 
additional floating point operations compared to GEPP. 

We also present C ALU_PRRP, a communication avoiding version of LU_PRRP 
that minimizes communication. CALU-PRRP is based on tournament pivoting, 
with the selection of the pivots at each step of the tournament being performed 
via strong rank revealing QR factorization. CALLLPRRP is more stable than 
CALU, the communication avoiding version of GEPP, with a theoretical upper 
bound of the growth factor of (1 +rb)%( H+1 ^~ 1 , where b is the size of the panel 
used during the factorization, r is a parameter of the strong rank revealing QR 
factorization, n is the number of columns of the matrix, and H is the height 
of the reduction tree used during tournament pivoting. The upper bound of 
the growth factor of CALU is 2"( ff + 1 )- 1 . CALU_PRRP is also more stable in 
practice and is resistant to pathological cases on which GEPP and CALU fail. 

Key-words: LU factorization, numerical stability, communication avoiding, 
strong rank revealing QR factorization 



LU.PRRP and CALU.PRRP 



3 



1 Introduction 

The LU factorization is an important operation in numerical linear algebra 
since it is widely used for solving linear systems of equations, computing the 
determinant of a matrix, or as a building block of other operations. It consists of 
the decomposition of a matrix A into the product A = WLU , where L is a lower 
triangular matrix, U is an upper triangular matrix, and II a permutation matrix. 
The performance of the LU decomposition is critical for many applications, and 
it has received a significant attention over the years. Recently large efforts have 
been invested in optimizing this linear algebra kernel, in terms of both numerical 
stability and performance on emerging parallel architectures. 

The LU decomposition can be computed using Gaussian elimination with 
partial pivoting, a very stable operation in practice, except for several patho- 
logical cases, such as the Wilkinson matrix [211 [14], the Foster matrix [7], or 
the Wright matrix [S3] . Many papers [20] [TT] Q15] discuss the stability of the 
Gaussian elimination, and it is known [TH [21 [5] that the pivoting strategy used, 
such as complete pivoting, partial pivoting, or rook pivoting, has an important 
impact on the numerical stability of this method, which depends on a quantity 
referred to as the growth factor. However, in terms of performance, these piv- 
oting strategies represent a limitation, since they require asympotically more 
communication than established lower bounds on communication indicate is 
necessary [HH]. 

Technological trends show that computing floating point operations is be- 
coming exponentially faster than moving data from the memory where they are 
stored to the place where the computation occurs. Due to this, the communica- 
tion becomes in many cases a dominant factor of the runtime of an algorithm, 
that leads to a loss of its efficiency. This is a problem for both a sequential 
algorithm, where data needs to be moved between different levels of the mem- 
ory hierarchy, and a parallel algorithm, where data needs to be communicated 
between processors. 

This challenging problem has prompted research on algorithms that reduce 
the communication to a minimum, while being numerically as stable as classic 
algorithms, and without increasing significantly the number of floating point 
operations performed [H [TT] . We refer to these algorithms as communication 
avoiding. One of the first such algorithms is the communication avoiding LU 
factorization (CALU) [HI [10]. This algorithm is optimal in terms of communi- 
cation, that is it performs only polylogarithmic factors more than the theoretical 
lower bounds on communication require [H [T] . Thus, it brings considerable im- 
provements to the performance of the LU factorization compared to the classic 
routines that perform the LU decomposition such as the PDGETRF routine 
of ScaLAPACK, thanks to a novel pivoting strategy referred to as tournament 
pivoting. It was shown that CALU is faster in practice than the corresponding 
routine PDGETRF implemented in libraries as ScaLAPACK or vendor libraries, 
on both distributed |11| and shared memory computers [5j. While in practice 
CALU is as stable as GEPP, in theory the upper bound of its growth factor is 
worse than that obtained with GEPP. One of our goals is to design an algo- 
rithm that minimizes communication and that has a smaller upper bound of its 
growth factor than CALU. 

In the first part of this paper we present the LU_PRRP factorization, a 
novel LU decomposition algorithm based on that we call panel rank revealing 
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pivoting (PRRP). The LILPRRP factorization is based on a block algorithm 
that computes the LU decomposition as follows. At each step of the block 
factorization, a block of columns (panel) is factored by computing the strong 
rank revealing QR (RRQR) factorization [T^] of its transpose. The permutation 
returned by the panel rank revealing factorization is applied on the rows of the 
input matrix, and the L factor of the panel is computed based on the R factor of 
the strong RRQR factorization. Then the trailing matrix is updated. In exact 
arithmetic, the LU_PRRP factorization computes a block LU decomposition 
based on a different pivoting strategy, the panel rank revealing pivoting. The 
factors obtained from this decomposition can be stored in place, and so the 
LU_PRRP factorization has the same memory requirements as standard LU 
and can easily replace it in any application. 

We show that LU-PRRP is more stable than GEPP. Its growth factor is 
upper bounded by (1 + t&)t, where b is the size of the panel, n is the num- 
ber of columns of the input matrix, and r is a parameter of the panel strong 
RRQR factorization. This bound is smaller than 2 n_1 , the upper bound of the 
growth factor for GEPP. For example, if the size of the panel is b — 64, then 
(1 + 2b) n / h = (1.079)™ < 2 n_1 . In terms of cost, it performs only 0{n 2 b) more 
floating point operations than GEPP. In addition, our extensive numerical ex- 
periments on random matrices and on a set of special matrices show that the 
LU_PRRP factorization is very stable in practice and leads to modest growth 
factors, smaller than those obtained with GEPP. It also solves easily pathologi- 
cal cases, as the Wilkinson matrix and the Foster matrix, on which GEPP fails. 
While the Wilkinson matrix is a matrix constructed such that GEPP has an 
exponential growth factor, the Foster matrix [5] arises from a real application. 

We also discuss the backward stability of LU_PRRP using three metrics, 
the relative error \\PA — LJ7||/||yl||, the normwise backward error Q, and the 
componentwise backward error ([8]). For the matrices in our set, the relative error 
is at most 5.26 x 10 -14 , the normwise backward error is at most 1.09 x 10 -14 , and 
the componentwise backward error is at most 3.3 x 10~ 14 (with the exception 
of three matrices, sprandn, compan, and Demmel, for which the componentwise 
backward error is 1.3 x 10~ 13 , 6.9 x 10" 12 , and 1.16 x 10 -8 respectively). Later 
in this paper, figure [2] displays the ratios of these errors versus the errors of 
GEPP, obtained by dividing the maximum of the backward errors of LU-PRRP 
and the machine epsilon (2~ 53 ) by the maximum of those of GEPP and the 
machine epsilon. For all the matrices in our set, the growth factor of LU-PRRP 
is always smaller than that of GEPP (with the exception of one matrix, the 
compar matrix) . For random matrices, the relative error of the factorization of 
LU_PRRP is always smaller than that of GEPP. However, for the normwise and 
the componentwise backward errors, GEPP is slightly better, with a ratio of at 
most 2 between the two. For the set of special matrices, the ratio of the relative 
error is at most 1 in over 75% of cases, that is LU-PRRP is more stable than 
GEPP. For the rest of the 25% of the cases, the ratio is at most 3, except for 
one matrix (hadamard) for which the ratio is 23 and the backward error is on 
the order of 10~ 15 . The ratio of the normwise backward errors is at most 1 in 
over 75% of cases, and always 3.4 or smaller. The ratio of the componentwise 
backward errors is at most 2 in over 81% of cases, and always 3 or smaller (except 
for one matrix, the compan matrix, for which the componentwise backward error 
is 6.9 x 10" 12 for LILPRRP and 6.2 x 10" 13 for GEPP). 

In the second part of the paper we introduce the CALU-PRRP factorization, 
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the communication avoiding version of LU-PRRP. It is based on tournament 
pivoting, a strategy introduced in [lOj in the context of CALU, a communication 
avoiding version of GEPP. With tournament pivoting, the panel factorization 
is performed in two steps. The first step selects b pivot rows from the entire 
panel at a minimum communication cost. For this, sets of b candidate rows are 
selected from blocks of the panel, which are then combined together through a 
reduction-like procedure, until a set of b pivot rows are chosen. CALILPRRP 
uses the strong RRQR factorization to select b rows at each step of the reduction 
operation, while CALU is based on GEPP. In the second step of the panel 
factorization, the pivot rows are permuted to the diagonal positions, and the QR 
factorization with no pivoting of the transpose of the panel is computed. Then 
the algorithm proceeds as the LU_PRRP factorization. Note that the usage 
of the strong RRQR factorization ensures that bounds are respected locally at 
each step of the reduction operation, but it does not ensure that the growth 
factor is bounded globally as in LU_PRRP. 

To address the numerical stability of the communication avoiding factor- 
ization, we show that performing the CALU_PRRP factorization of a matrix 
A is equivalent to performing the LU-PRRP factorization of a larger matrix, 
formed by blocks of A and zeros. This equivalence suggests that CALU_PRRP 
will behave as LU_PRRP in practice and it will be stable. The dimension and 
the sparsity structure of the larger matrix also allows us to upper bound the 
growth factor of CALU-PRRP by (1 + T&)? (ir+1 ) _1 , where in addition to the 
parameters n, b, and r previously defined, H is the height of the reduction tree 
used during tournament pivoting. 

This algorithm has two significant advantages over other classic factoriza- 
tion algorithms. First, it minimizes communication, and hence it will be more 
efficient than LU_PRRP and GEPP on architectures where communication is 
expensive. Here communication refers to both latency and bandwidth costs of 
moving data between levels of the memory hierarchy in the sequential case, and 
the cost of moving data between processors in the parallel case. Second, it is 
more stable than CALU. Theoretically, the upper bound of the growth factor of 
CALU_PRRP is smaller than that of CALU, for a reduction tree with a same 
height. More importantly, there are cases of interest for which it is smaller 
than that of GEPP as well. Given a reduction tree of height H = logP, where 
P is the number of processors on which the algorithm is executed, the panel 
size b and the parameter r can be chosen such that the upper bound of the 
growth factor is smaller than 2 n_1 . Extensive experimental results show that 
CALUJPRRP is as stable as LU_PRRP, GEPP, and CALU on random matrices 
and a set of special matrices. Its growth factor is slightly smaller than that of 
CALU. In addition, it is also stable for matrices on which GEPP fails. 

As for the LU_PRRP factorization, we discuss the stability of CALU_PRRP 
using three metrics. For the matrices in our set, the relative error is at most 
9.14 x 10~ 14 , the normwise backward error is at most 1.37 x 10~ 14 , and the 
componentwise backward error is at most 1.14x 1CP 8 for Demmel matrix. Figure 
[5] displays the ratios of the errors with respect to those of GEPP, obtained by 
dividing the maximum of the backward errors of CALU_PRRP and the machine 
epsilon by the maximum of those of GEPP and the machine epsilon. For random 
matrices, all the backward error ratios are at most 2.4. For the set of special 
matrices, the ratios of the relative error are at most 1 in over 62% of cases, and 
always smaller than 2, except for 8% of cases, where the ratios are between 2.4 
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and 24.2. The ratios of the normwise backward errors are at most 1 in over 
75% of cases, and always 3.9 or smaller. The ratios of componentwise backward 
errors are at most 1 in over 47% of cases, and always 3 or smaller, except for 7 
ratios which have values up to 74. 

We also discuss a different version of LU.PRRP that minimizes commu- 
nication, but can be less stable than CALILPRRP, our method of choice for 
reducing communication. In this different version, the panel factorization is 
performed only once, during which its off-diagonal blocks are annihilated using 
a reduce-like operation, with the strong RRQR factorization being the operator 
used at each step of the reduction. Every such factorization of a block of rows of 
the panel leads to the update of a block of rows of the trailing matrix. Indepen- 
dently of the shape of the reduction tree, the upper bound of the growth factor 
of this method is the same as that of LU-PRRP. This is because at every step 
of the algorithm, a row of the current trailing matrix is updated only once. We 
refer to the version based on a binary reduction tree as block parallel LLLPRRP, 
and to the version based on a flat tree as block pairwise LLLPRRP. There are 
similarities between these two algorithms, the LU factorization based on block 
parallel pivoting (an unstable factorization), and the LU factorization based on 
block pairwise pivoting (whose stability is still under investigation) [TSJ [2UJ 12] • 
All these methods perform the panel factorization as a reduction operation, and 
the factorization performed at every step of the reduction leads to an update 
of the trailing matrix. However, in block parallel pivoting and block pairwise 
pivoting, GEPP is used at every step of the reduction, and hence U factors are 
combined together during the reduction phase. While in the block parallel and 
block pairwise LU_PRRP, the reduction operates always on original rows of the 
current panel. 

Despite having better bounds, the block parallel LU_PRRP based on a binary 
reduction tree of height H = log P is unstable for certain values of the panel 
size b and the number of processors P. The block pairwise LU-PRRP based on 
a flat tree of height H = ? appears to be more stable. The growth factor is 
larger than that of CALU_PRRP, but it is smaller than n for the sizes of the 
matrices in our test set. Hence, potentially this version can be more stable than 
block pairwise pivoting, but requires further investigation. 

The remainder of the paper is organized as follows. Section [2] presents the 
algebra of the LU_PRRP factorization, discusses its stability, and compares it 
with that of GEPP. It also presents experimental results showing that LU_PRRP 
is more stable than GEPP in terms of worst case growth factor, and it is more 
resistant to pathological matrices on which GEPP fails. Section [3] presents the 
algebra of CALU_PRRP, a communication avoiding version of LU_PRRP. It 
describes similarities between CALU-PRRP and LU-PRRP and it discusses its 
stability. The communication optimality of CALU_PRRP is shown in section |4j 
where we also compare its performance model with that of the CALU algorithm. 
Section [5] discusses two alternative algorithms that can also reduce communi- 
cation, but can be less stable in practice. Section [6] concludes and presents our 
future work. 
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2 LU PRRP Method 

In this section we introduce the LILPRRP factorization, an LU decomposition 
algorithm based on panel rank revealing pivoting strategy. It is based on a 
block algorithm, that factors at each step a block of columns (a panel), and 
then it updates the trailing matrix. The main difference between LLLPRRP 
and GEPP resides in the panel factorization. In GEPP the panel factorization 
is computed using LU with partial pivoting, while in LU_PRRP it is computed 
by performing a strong RRQR factorization of its transpose. This leads to a 
different selection of pivot rows, and the obtained R factor is used to compute 
the block L factor of the panel. In exact arithmetic, LU_PRRP performs a block 
LU decomposition with a different pivoting scheme, which aims at improving the 
numerical stability of the factorization by bounding more efficiently the growth 
of the elements. We also discuss the numerical stability of LU_PRRP, and we 
show that both in theory and in practice, LU_PRRP is more stable than GEPP. 



2.1 The algebra 

LU_PRRP is based on a block algorithm that factors the input matrix A of size 
mxnby traversing blocks of columns of size 6. Consider the first step of the 
factorization, with the matrix A having the following partition, 



A 



An A 12 
A21 A22 



(1) 



where An is of size 6x6, A21 is of size (m — b) x 6, A12 is of size 6 x (n — b), 
and A22 is of size (m — 6) x (n — 6). 

The main idea of the LILPRRP factorization is to eliminate the elements 
below the 6x6 diagonal block such that the multipliers used during the update 
of the trailing matrix are bounded by a given threshold r. For this, we perform 
a strong RRQR factorization on the transpose of the first panel of size m x 6 to 
identify a permutation matrix LI, that is 6 pivot rows, 



A 2 i 



1 T 



n 



An 
Mi 



= Q [ R(l : 6, 1 : 6) R(l : b, b + 1 : m) ] = Q [ Rn R u 



where A denotes the permuted matrix A. The strong RRQR factorization en- 
sures that the quantity R^i-R-ii) 7 ^ s bounded by a given threshold r in the 
max norm. The strong RRQR factorization, as described in Algorithm [2] in Ap- 
pendix A, computes first the QR factorization with column pivoting, followed 
by additional swaps of the columns of the R factor and updates of the QR 
factorization, so that ll^ii^ii^llmax < t. 

After the panel factorization, the transpose of the computed permutation II 
is applied on the input matrix A, and then the update of the trailing matrix is 
performed, 



A = W A 



h 

L 2 i In 



An 



A 12 

A 22 



(2) 



where 



A\2 = A22 - L21A 



12- 



(3) 
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Note that in exact arithmetic, we have L 2 \ — A^iA^ = i?^ 2 (i? 11 1 ) T . Hence 
the factorization in equation |2]) is equivalent to the factorization 



A = It 1 A = 



h 

i 2 iin 



T m — h 



Au 



Mi 



where 



A s 22 = A 2 



A 21 A 11 1 A 12 , 



(4) 



and A^A^ was computed in a numerically stable way such that it is bounded 
in max norm by r. 

Since the block size is in general 6 > 2, performing LU_PRRP on a given 
matrix A first leads to a block LU factorization, with the diagonal blocks An 
being square of size 6x6. An additional Gaussian elimination with partial 
pivoting is performed on the 6x6 diagonal block An as well as the update of 
the corresponding trailing matrix Ai 2 . Then the decomposition obtained after 
the elimination of the first panel of the input matrix A is 



A 



Il T A - 



lb 




' An Al2 " 




h 




' ill 




' Uu U13 ' 


L21 Im-b 




Ah . 




L21 Im-b 




I m — b 




Ma . 



where L\\ is a lower triangular 6x6 matrix with unit diagonal and U\\ is an 



upper triangular 6x6 matrix. We show in section [272] that this step does not 
affect the stability of the LU-PRRP factorization. Note that the factors L and 
U can be stored in place, and so LLLPRRP has the same memory requirements 
as the standard LU decomposition and can easily replace it in any application. 

Algorithm [T] presents the LU_PRRP factorization of a matrix A of size nxn 
partitioned into ^ panels. The number of floating-point operations performed 
by this algorithm is 



#flops= -n 3 + 0(n 2 b), 

which is only 0(n 2 b) more floating point operations than GEPP. The detailed 
counts are presented in Appendix C. When the QR factorization with column 
pivoting is sufficient to obtain the desired bound for each panel factorization, 
and no additional swaps are performed, the total cost is 



#flops = -n 



-n 2 b. 



2.2 Numerical stability 

In this section we discuss the numerical stability of the LU_PRRP factorization. 
The stability of an LU decomposition depends on the growth factor. In his 
backward error analysis |21j . Wilkinson proved that the computed solution x 
of the linear system Ax = 6, where A is of size nxn, obtained by Gaussian 
elimination with partial pivoting or complete pivoting satisfies 

(A + AA)x = 6, || AAHoo < pirijgwuWAWoo. 

In the formula, p(n) is a cubic polynomial, u is the machine precision, and gw 
is the growth factor defined by 
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Algorithm 1 LLLPRRP factorization of a matrix A of size n x n 
l: for j from 1 to ? do 

2: Let Aj be the current panel Aj = A((j — 1)6 + 1 : n, (j — 1)6 + 1 : jb). 
3: Compute panel factorization AjUj :— QjRj using strong RRQR factor- 
ization, 

4: L 2j := (-Rj(l : 6, 1 : &) -1 iZ,-(l : 6, 6 + 1 : n - (j - 1)6)) T . 

5: Pivot by applying the permutation matrix Hj on the entire matrix, 

A = HjA. 
6: Update the trailing matrix, 

7: A(jb + l : n,jb+l : n)- = L 2j A((j - 1)6 + 1 : jbjb + l : n). 

8: Let Ajj be the current 6x6 diagonal block, 
9: A n = A((j - 1)6 + 1 : j6, (j - 1)6 + 1 : jb). 

10: Compute = HjjLjjUjj using GEPP. 

11: Compute C/((j-l)6+l : j6, j6+l : n) = /., ,' 1 :6 • ! : jb,jb+l : 

n). 

12: end for 



i ( fc ) i 

9W = m axij\aij\ ' 

where a^l denotes the entry in position (i,j) obtained after k steps of elim- 
ination. Thus the growth factor measures the growth of the elements dur- 
ing the elimination. The LU factorization is backward stable if gw is of or- 
der 0(1) (in practice the method is stable if the growth factor is a slowly 
growing function of n). Lemma 9.6 of [13 (section 9.3) states a more gen- 
eral result, showing that the LU factorization without pivoting of A is back- 
ward stable if the growth factor is small. Wilkinson [21] showed that for 
partial pivoting, the growth factor g\y < 2™ _1 , and this bound is attain- 
able. He also showed that for complete pivoting, the upper bound satisfies 
gw < n 1 / 2 (2.3 1 / 2 n i/(«-i))i/2 _ cn i/2 n i/4io g n^ In practice trie growth fac- 
tors are much smaller than the upper bounds. 

In the following, we derive the upper bound of the growth factor for the 
LU_PRRP factorization. We use the same notation as in the previous section 
and we assume without loss of generality that the permutation matrix is the 
identity. It is easy to see that the growth factor obtained after the elimination 
of the first panel is bounded by (1 + rb). At the k-th step of the block factor- 
ization, the active matrix A^ is of size (m — (k — 1)6) x (n — (k — 1)6), and the 
decomposition performed at this step can be written as 

A (k) Ak) 

The active matrix at the (k+l)-th step is A 22 +1} = A 22 )s = A { 22 ] - L^A$. 
Then maxjj \a\ k ^\ < maxi j \a\ ^ |(1 + rb) with maxjj \L 2 \ (i,j)\ < r and we 
have 

9w (k+1) <9w {k \l + rb). (5) 



A (k) 

\{k) 



A 



A 2l A. 



(fc) 

12 

(k) 



L 



h 

(k) 



m-(fe+l)b 
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Induction on equation ([5| leads to a growth factor of LLLPRRP performed on 
the ^ panels of the matrix A that satisfies 

9w <(l + rb) n/b . (6) 

As explained in the algebra section, the LLLPRRP factorization leads first to a 
block LU factorization, which is completed with additional GEPP factorizations 
of the diagonal b x b blocks and updates of corresponding blocks of rows of the 
trailing matrix. These additional factorizations lead to a growth factor bounded 
by 2 b on the trailing blocks of rows. Since we choose b -C n, we conclude that 
the growth factor of the entire factorization is still bounded by (1 + r6)"/ b . 

The improvement of the upper bound of the growth factor of LLLPRRP with 
respect to GEPP is illustrated in Table [l] where the panel size varies from 8 to 
128, and the parameter r is equal to 2. The worst case growth factor becomes 
arbitrarily smaller than for GEPP, for b > 64. 

Table 1: Upper bounds of the growth factor gw obtained from factoring a matrix 
of size m x n using LLLPRRP with different panel sizes and r = 2. 



b 


9w 


8 


(1.425)"- 1 


16 


(1.244) n - 1 


32 


(1.139)"" 1 


64 


(1.078)"- 1 


128 


(1.044) 71 " 1 



Despite the complexity of our algorithm in pivot selection, we still compute 
an LU factorization, only with different pivots. Consequently, the rounding error 
analysis for LU factorization still applies (see, for example, [3]), which indicates 
that element growth is the only factor controlling the numerical stability of our 
algorithm. 

2.3 Experimental results 

We measure the stability of the LU_PRRP factorization experimentally on a 
large set of test matrices by using several metrics, as the growth factor, the 
normwise backward stability, and the componentwise backward stability. The 
tests are performed in Matlab. In the tests, in most of the cases, the panel fac- 
torization is performed by using the QR with column pivoting factorization in- 
stead of the strong RRQR factorization. This is because in practice R^^iiY^ 
is already well bounded after performing the RRQR factorization with column 
pivoting (||-Ri2(^ri 1 ) T ||max is rarely bigger than 3). Hence no additional swaps 
are needed to ensure that the elements are well bounded. However, for the ill- 
conditionned special matrices (condition number > 10 14 ), to get small growth 
factors, we perform the panel factorization by using the strong RRQR factor- 
ization. In fact, for these cases, QR with column pivoting does not ensure a 
small bound for i?f 2 (i?^" 1 1 ) T . 

We use a collection of matrices that includes random matrices, a set of 
special matrices described in Table [l5j and several pathological matrices on 
which Gaussian elimination with partial pivoting fails because of large growth 
factors. The set of special matrices includes ill-conditioned matrices as well as 
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sparse matrices. The pathological matrices considered are the Wilkinson matrix 
and two matrices arising from practical applications, presented by Foster [7] and 
Wright [33], for which the growth factor of GEPP grows exponentially. The 
Wilkinson matrix was constructed to attain the upper bound of the growth 
factor of GEPP [3TJ [TJ] , and a general layout of such a matrix is 



A = diag(±l) 



1 

-1 1 

-1 -1 



•• 1 
-1 1 1 
-1 -1 1 



T 





9 



where T is an (n — 1) x (n — 1) non-singular upper triangular matrix and 9 = 
max|ajj|. We also test a generalized Wilkinson matrix, the general form of such 
a matrix is 



A = 







1 

1 1 
1 



+ T 1 



where T is an n x n upper triangular matrix with zero entries on the main 
diagonal. The matlab code of the matrix A is detailed in Appendix F. 

The Foster matrix represents a concrete physical example that arises from 
using the quadrature method to solve a certain Volterra integral equation and 
it is of the form 



A = 



1 

kh 
2 

kh 
2 



kh 
2 

kh 
2 





i kh 

1 2 



-kh 



-kh 1 



-kh 



Ml 

2 



-kh 
-kh 



kh 
2 



-kh 



1 — - — 



Wright [23 discusses two-point boundary value problems for which standard 
solution techniques give rise to matrices with exponential growth factor when 
Gaussian elimination with partial pivoting is used. This kind of problems arise 
for example from the multiple shooting algorithm. A particular example of this 
problem is presented by the following matrix, 



I 

„Mh 



A = 



I 

„Mh 



I 





-Mh 
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where e Mh = I + Mh + 0{h 2 ). 

The experimental results show that the LU-PRRP factorization is very sta- 
ble. Figure [T] displays the growth factor of LILPRRP for random matrices of 
size varying from 1024 to 8192 and for sizes of the panel varying from 8 to 128. 
We observe that the smaller the size of the panel is, the bigger the element 
growth is. In fact, for a smaller size of the panel, the number of panels and the 
number of updates on the trailing matrix is bigger, and this leads to a larger 
growth factor. But for all panel sizes, the growth factor of LU_PRRP is smaller 
than the growth factor of GEPP. For example, for a random matrix of size 4096 
and a panel of size 64, the growth factor is only about 19, which is smaller than 
the growth factor obtained by GEPP, and as expected, much smaller than the 
theoretical upper bound of (1.078) 4095 . 



Figure 1: Growth factor gw of the LU_PRRP factorization of random matrices. 



70 - 




1024 2048 4096 8192 

matrix size 



Tables [16] and [18] in Appendix B present more detailed results showing the 
stability of the LILPRRP factorization for random matrices and a set of special 
matrices. There, we include different metrics, such as the norm of the factors, 
the value of their maximum element and the backward error of the LU factoriza- 
tion. We evaluate the normwise backward stability by computing three accuracy 
tests as performed in the HPL (High-Performance Linpack) benchmark [6], and 
denoted as HPL1, HPL2 and HPL3. 

HPL1 = ||AB-&||oo/(e||i4||i*JV), 
HPL2 = WAx-bW^/ieWAMxWx), 
HPL3 = WAx-bWoo/ieWAW^WxWoo*^, 

In HPL, the method is considered to be accurate if the values of the three 
quantities are smaller than 16. More generally, the values should be of order 
0(1). For the LLLPRRP factorization HPL1 is at most 8.09, HPL2 is at most 
8.04 x 10~ 2 and HPL3 is at most 1.60 x 10~ 2 . We also display the normwise 
backward error, using the 1-norm, 

" := Pll M+WV (7) 
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and the componentwise backward error 



max 



{\A\ 



(8) 



where the computed residual is r — 6 — Ax. For our tests residuals are computed 
with double-working precision. 

Figure [2] summarizes all our stability results for LU_PRRP. This figure dis- 
plays the ratio of the maximum between the backward error and machine epsilon 
of LILPRRP versus GEPP. The backward error is measured using three met- 
rics, the relative error \\PA — LU\\/\\A\\, the normwise backward error 77, and 
the componentwise backward error w of LU_PRRP versus GEPP, and the ma- 
chine epsilon. We take the maximum of the computed error with epsilon since 
smaller values are mostly roundoff error, and so taking ratios can lead to ex- 
treme values with little reliability. Results for all the matrices in our test set are 
presented, that is 20 random matrices for which results are presented in Table 
[TBI an d 37 special matrices for which results are presented in Tables [T7] and [18] 
This figure shows that for random matrices, almost all ratios are between 0.5 
and 2. For special matrices, there are few outliers, up to 23.71 (GEPP is more 
stable) for the backward error ratio of the special matrix hadamard and down 
to 2.12 x 10~ 2 (LILPRRP is more stable) for the backward error ratio of the 
special matrix moler. 



I I. I 




1 » ■■iLl.i, 



fipr 




special matrices 



special matrices 



special matrices 



||PA-LU||/||A|| 



normwise backward error 



componenUvise backward error 



Figure 2: A summary of all our experimental data, showing the ratio between 
max(LLLPRRP's backward error, machine epsilon) and max(GEPP's backward 
error, machine epsilon) for all the test matrices in our set. Each vertical bar 
represents such a ratio for one test matrix. Bars above 10° = 1 mean that 
LLLPRRP's backward error is larger, and bars below 1 mean that GEPP's 
backward error is larger. For each matrix and algorithm, the backward error is 
measured 3 ways. For the first third of the bars, labeled \\PA — L£/||/||A||, the 
metric is the backward error, using the Frobenius norm. For the middle third 
of the bars, labeled "normwise backward error" , the metric is r\ in equation 
Q. For the last third of the bars, labeled "componentwise backward error", 
the metric is w in equation Q. The test matrices are further labeled either as 
"randn", which are randomly generated, or "special", listed in Table [T5] 



We consider now pathological matrices on which GEPP fails. Table[2]presents 
results for the linear solver using the LILPRRP factorization for a Wilkinson 
matrix [22] of size 2048 with a size of the panel varying from 8 to 128. The 
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growth factor is 1 and the relative error - — rpj, — - is on the order of 10 19 . Ta- 
ble [3] presents results for the linear solver using the LtLPRRP algorithm for a 
generalized Wilkinson matrix of size 2048 with a size of the panel varying from 
8 to 128. 



Table 2: Stability of the LU-PRRP factorization of a Wilkinson matrix on which 



GEPP fails. 



n 


b 


9w 


\W\\i 




\\L\\i 


ll^lli 


\\PA-LU\\ F 
\\A\\p 


2048 


128 


1 


1.02c+03 


6.09e+00 


1 


1.95e+00 


4.25e-20 


64 


1 


1.02c+03 


6.09c+00 


1 


1.95c+00 


5.29e-20 


32 


1 


1.02c+03 


6.09c+00 


1 


1.95c+00 


8.63e-20 


16 


1 


1.02c+03 


6.09c+00 


1 


1.95c+00 


1.13e-19 


8 


1 


1.02c+03 


6.09c+00 


1 


1.95c+00 


1.57e-19 



Table 3: Stability of the LU-PRRP factorization of a generalized Wilkinson 
matrix on which GEPP fails. 



n 


b 


9w 


\\U\\i 


\\u-% 


\\L\\i 


ll^llt 


\\PA-LU\\ F 
\\A\\f 


2048 


128 


2.69 


1.23e+03 


1.39c+02 


1.21c+03 


1.17c+03 


1.05e-15 


64 


2.61 


9.09e+02 


1.12c+02 


1.36e+03 


1.15c+03 


9.43e-16 


32 


2.41 


8.20e+02 


1.28c+02 


1.39c+03 


9.77c+02 


5.53e-16 


16 


4.08 


1.27e+03 


2.79e+02 


1.41e+03 


1.19e+03 


7.92e-16 


8 


3.35 


1.36e+03 


2.19c+02 


1.41c+03 


1.73c+03 


1.02e-15 



For the Foster matrix, it was shown that when c = 1 and kh = |, the growth 
factor of GEPP is (|)(2™^ 1 - 1), which is close to the maximum theoretical 
growth factor of GEPP of 2" _1 . Table [4] presents results for the linear solver 
using the LU-PRRP factorization for a Foster matrix of size 2048 with a size 
of the panel varying from 8 to 128 ( c = 1, h = 1 and k = | ). According 
to the obtained results, LU-PRRP gives a modest growth factor of 2.66 for 
this practical matrix, while GEPP has a growth factor of 10 18 for the same 
parameters. 



Table 4: Stability of the LU-PRRP factorization of a practical matrix (Foster) 
on which GEPP fails. 



n 


b 


9w 


\W\\i 


llt^Hi 


\\L\U 


ll^llt 


\\PA-LU\\ F 
\\M\f 


2048 


128 


2.66 


1.28e+03 


1.87c+00 


1.92c+03 


1.92e+03 


4.67e-16 


64 


2.66 


1.19e+03 


1.87c+00 


1.98c+03 


1.79c+03 


2.64e-16 


32 


2.66 


4.33e+01 


1.87c+00 


2.01e+03 


3.30e+01 


2.83e-16 


16 


2.66 


1.35c+03 


1.87c+00 


2.03c+03 


2.03c+00 


2.38e-16 


8 


2.66 


1.35c+03 


1.87c+00 


2.04c+03 


2.02c+00 


5.36e-17 



For matrices arising from the two-point boundary value problems described 
by Wright, it was shown that when h is chosen small enough such that all 
elements of e Mh are less than 1 in magnitude, the growth factor obtained using 



GEPP is exponential. For our experiment the matrix M 



1 



that 
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is e 



Mh 



1 - 



1-g 



and ft, = 0.3. Table 



presents results for the linear 



solver using the LILPRRP factorization for a Wright matrix of size 2048 with a 
size of the panel varying from 8 to 128. According to the obtained results, again 
LU_PRRP gives minimum possible pivot growth 1 for this practical matrix, 
compared to the GEPP method which leads to a growth factor of 10 95 using 
the same parameters. 



Table 5: Stability of the LILPRRP factorization on a practical matrix (Wright) 
on which GEPP fails. 



n 


b 


gw 


\\U\\i 


ll^lli 


\\L\\i 


ll^llt 


\\PA-LU\\ F 
\\A\\ F 




128 


l 


3.25e+00 


8.00e+00 


2.00e+00 


2.00e+00 


4.08e-17 




64 


l 


3.25c+00 


8.00c+00 


2.00c+00 


2.00c+00 


4.08e-17 


2048 


32 


l 


3.25c+00 


8.00c+00 


2.05c+00 


2.07c+00 


6.65e-17 




16 


l 


3.25e+00 


8.00e+00 


2.32e+00 


2.44e+00 


1.04e-16 




8 


l 


3.40c+00 


8.00c+00 


2.62c+00 


3.65c+00 


1.26e-16 



All the previous tests show that the LILPRRP factorization is very stable for 
random, and for more special matrices, and it also gives modest growth factor 
for the pathological matrices on which GEPP fails. We note that we were not 
able to find matrices for which LLLPRRP attains the upper bound of (1 + rb) T 
for the growth factor. 



3 Communication avoiding LU_PRRP 

In this section we present a communication avoiding version of the LU-PRRP 
algorithm, that is an algorithm that minimizes communication, and so it will be 
more efficient than LU_PRRP and GEPP on architectures where communication 
is expensive. We show in this section that this algorithm is more stable than 
CALU, an existing communication avoiding algorithm for computing the LU 
factorization [10]. More importantly, its parallel version is also more stable 
than GEPP (under certain conditions). 

3.1 Matrix algebra 

CALU_PRRP is a block algorithm that uses tournament pivoting, a strategy 
introduced in [10] that allows to minimize communication. As in LU_PRRP, 
at each step the factorization of the current panel is computed, and then the 
trailing matrix is updated. However, in CALU-PRRP the panel factorization 
is performed in two steps. The first step, which is a preprocessing step, uses a 
reduction operation to identify b pivot rows with a minimum amount of com- 
munication. The strong RRQR factorization is the operator used at each node 
of the reduction tree to select a new set of b candidate rows from the candidate 
rows selected at previous stages of the reduction. The b pivot rows are permuted 
into the diagonal positions, and then the QR factorization with no pivoting of 
the transpose of the entire panel is computed. 

In the following we illustrate tournament pivoting on the first panel, with the 
input matrix A partitioned as in equation (TH). Tournament pivoting considers 
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that the first panel is partitioned into P = 4 blocks of rows, 



A(:,l:b) = 



A w 
A 20 
A 30 



The preprocessing step uses a binary reduction tree in our example, and we 
number the levels of the reduction tree starting from 0. At the leaves of the 
reduction tree, a set of 6 candidate rows are selected from each block of rows A i0 
by performing the strong RRQR factorization on the transpose of each block 
A i0 . This gives the following decomposition, 



^oolloo 




Qoo-Roo 


Aj U 10 




Q10R10 


A 20II20 




Q20R20 


^301130 




Q30-R30 



which can be written as 



A(:,l: b) T Yl = A{:, 1 : bf 



' IIoo 




Qoo-Roo 


nio 




Q10R10 


n 20 




Q20R20 


n 30 




Q30-R30 



where n is an m x m permutation matrix with diagonal blocks of size 7J p x ^ , 
Qio is an orthogonal matrix of size 6x6, and each factor R i0 is an b x ^ upper 
triangular matrix. 

There are now P = 4 sets of candidate rows. At the first level of the binary 
tree, two matrices ^4 i an d An are formed by combining together two sets of 
candidate rows, 





,1 


b) ' 




,1 


b) . 


" (4r n 2 o)( 


,1 


b) ' 


. (A&IIaoX 


,1 


b) . 



Two new sets of candidate rows are identified by performing the strong RRQR 
factorization of each matrix A i and An, 

^01 Hoi = Qoi-Roi; 
^nlli! = Q11-R11, 



where IIio, ITn are permutation matrices of size 26 x 26, Q01, Q11 are orthogonal 
matrices of size 6x6, and i? i, ^11 are upper triangular factors of size 6 x 26. 

The final 6 pivot rows are obtained by performing one last strong RRQR 
factorization on the transpose of the following 6 x 26 matrix : 



A02 = 



(45in i)(:,l:&) 
(^inn)(:,l:6) 
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that is 



^02^02 — Q02R02, 



where n 02 is a permutation matrix of size 26 x 26, Qq 2 is an orthogonal matrix 
of size b x b, and Rq 2 is an upper triangular matrix of size b x 2b. This operation 
is performed at the root of the binary reduction tree, and this ends the first step 
of the panel factorization. In the second step, the final pivot rows identified by 
tournament pivoting are permuted to the diagonal positions of A, 

A = u t a = n^nf til a, 

where the matrices IT are obtained by extending the matrices n to the dimension 
mx?n, that is 



IIi = 

with Hi, for i = 0, 1 formed as 
n a (l: 6,1:6) 
n a (6+l : 26,1 : 6) 



n n 



flu 



Im-t, 



and 



n 2 = 



n 02 (l : 6,1 : 6) 
II 02 (6+ 1 : 26,1 : 6) 



nn 



ITi(l : 6,6 + 1 : 26) 
n ix (6+l : 26,6 + 1 : 26) 

n 02 (l : 6,6+1 : 26) 
n 02 (6+ 1 : 26,6 + 1 : 26) 



Once the pivot rows are in the diagonal positions, the QR factorization with 
no pivoting is performed on the transpose of the first panel, 

A T (l:b,:) = QR= [ R xl R 12 ]. 

This factorization is used to update the trailing matrix, and the elimination of 
the first panel leads to the following decomposition (we use the same notation 
as in section [2I, 



A 



h 



m — b 



An 



A 12 
A s 



where 



i 22 = A 22 - A 21 A^ A 12 . 



As in the LILPRRP factorization, the CALU_PRRP factorization computes a 
block LU factorization of the input matrix A. To obtain the full LU factoriza- 
tion, an additional GEPP is performed on the diagonal block An, followed by 
the update of the block row ^4i 2 . The CALU_PRRP factorization continues the 
same procedure on the trailing matrix A 22 . 
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Note that the factors L and U obtained by the CALLLPRRP factoriza- 
tion are different from the factors obtained by the LU-PRRP factorization. 
The two algorithms use different pivot rows, and in particular the factor L of 
CALILPRRP is no longer bounded by a given threshold r as in LU_PRRP. 
This leads to a different worst case growth factor for CALILPRRP, that we will 
discuss in the following section. 

The following figure displays the binary tree based tournament pivoting per- 
formed on the first panel using an arrow notation (as in 10J). The function 
f(Aij) computes a strong RRQR of the matrix Ajj to select a set of b candi- 
date rows. At each node of the reduction tree, two sets of b candidate rows are 
merged together and form a matrix Ay, the function / is applied on A^j, and 
another set of b candidate rows is selected. While in this section we focused on 
binary trees, tournament pivoting can use any reduction tree, and this allows 
the algorithm to adapt on different architectures. Later in the paper we will 
consider also a flat reduction tree. 



Aq -> f(A 00 ) 
A w -> f(A 10 ) 



A 20 ^ f(A 20 )\ /> 
A 30 ^f(A 3Q )^ f{All) 



f(A, 



D2 I 



3.2 Numerical Stability of CALU_PRRP 

In this section we discuss the stability of the CALLLPRRP factorization and 
we identify similarities with the LU_PRRP factorization. We also discuss the 
growth factor of the CALU-PRRP factorization, and we show that its upper 
bound depends on the height of the reduction tree. For the same reduction tree, 
this upper bound is smaller than that obtained with CALU. More importantly, 
for cases of interest, the upper bound of the growth factor of CALLLPRRP is 
also smaller than that obtained with GEPP. 

To address the numerical stability of CALU-PRRP, we show that perform- 
ing CALU_PRRP on a matrix A is equivalent to performing LU-PRRP on a 
larger matrix Anjpppp, which is formed by blocks of A (sometimes slightly 
perturbed) and blocks of zeros. This reasoning is also used in [TU] to show the 
same equivalence between CALU and GEPP. While this similarity is explained 
in detail in [10] , here we focus only on the first step of the CALU_PRRP factor- 
ization. We explain the construction of the larger matrix Ajjj ^purp to expose 
the equivalence between the first step of the CALU_PRRP factorization of A 
and the LU-PRRP factorization of Alu_prrp- 

Consider a nonsingular matrix A of size m x n and the first step of its 
CALUJPRRP factorization using a general reduction tree of height H. Tour- 
nament pivoting selects b candidate rows at each node of the reduction tree by 
using the strong RRQR factorization. Each such factorization leads to an L 
factor which is bounded locally by a given threshold r. However this bound is 
not guaranteed globally. When the factorization of the first panel is computed 
using the b pivot rows selected by tournament pivoting, the L factor will not 
be bounded by r. This results in a larger growth factor than the one obtained 
with the LU-PRRP factorization. Recall that in LU-PRRP, the strong RRQR 
factorization is performed on the transpose of the whole panel, and so every 
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entry of the obtained lower triangular factor L is bounded by r. 

However, we show now that the growth factor obtained after the first step 
of the CALILPRRP factorization is bounded by (1 + rb) H+1 . Consider a row 
j, and let A s (j,b + 1 : n) be the updated row obtained after the first step of 
elimination of CALILPRRP. Suppose that row j of A is a candidate row at 
level k — 1 of the reduction tree, and so it participates in the strong RRQR 
factorization computed at a node Sk at level k of the reduction tree, but it is 
not selected as a candidate row by this factorization. We refer to the matrix 
formed by the candidate rows at node as A k . Hence, row j is not used to form 
the matrix A/.. Similarly, for every node i on the path from node Sk to the root 
of the reduction tree of height H , we refer to the matrix formed by the candidate 
rows selected by strong RRQR as Ai. Note that in practice it can happen that 
one of the blocks of the panel is singular, while the entire panel is nonsingular. 
In this case strong RRQR will select less than 6 linearly independent rows that 
will be passed along the reduction tree. However, for simplicity, we assume in 
the following that the matrices Ai are nonsingular. For a more general solution, 
the reader can consult |10) . 

Let n be the permutation returned by the tournament pivoting strategy 
performed on the first panel, that is the permutation that puts the matrix Ah 
on diagonal. The following equation is satisfied, 

(Ah A h \ _ ( h \ (A H A H \ (Q , 

\A(j,l:b) A(j,b+l:n)J \A(j, 1 : b)A£ l) ' \ A°(j,b + 1 : n) J ' W 

where 



A H = (IL4)(1:6,1:6), 
A H = (IL4)(1 : 6,6 + 1 : n). 

The updated row A s (j, 6+1 : n) can be also obtained by performing LLLPRRP 
on a larger matrix Alu_prrp of dimension {{H— fc + l)6+l) x {{H — fc+ 1)6+1), 



.4 



/ Ah 

A H -1 A H -1 

Ah-2 Ah-2 



LU-PRRP 



V 

(- h - X 

Ah-2A h _ 1 lb 



A H \ 



A k A h 

(-l) H - k A(j,l:b) A(j,b + l:n)J 



V 

(A H 



A H -i 



AkA^ I b 

{-l) H - k A(j,l:b)A^ 1/ 

A H \ 

A H -i 



A! 



Ah-2 

A k A k 

A*(j,b + l:n)J 



(10) 
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where 



A 



H-i = 



A! 



if i = 0, 



-A H -iA H \ i+l A H -i+± XQ <i<H-k. 



(11) 



Equation ( 10 ) can be easily verified, since 



A s (j,b+l:n) = A(j,b + l:n)-(-l) H - k A(j,l:b)A^(-l) H - k A k 

= A(j, b + 1 : n) - A{j, 1 : tyA^AkA^ . . . Ah-2A^_ x Ah-xA^ A h 
= A(j,b + l:n)-A(j, l:b)A- x A H . 



Equations ([9]) and ( 10 ) show that the Schur complement obtained after each 
step of performing the CALLLPRRP factorization of a matrix A is equivalent 
to the Schur complement obtained after performing the LLLPRRP factoriza- 
tion of a larger matrix A^jj _prrp, formed by blocks of A (sometimes slightly 
perturbed) and blocks of zeros. More generally, this implies that the entire 
CALILPRRP factorization of A is equivalent to the LILPRRP factorization of 
a larger and very sparse matrix, formed by blocks of A and blocks of zeros (we 
omit the proofs here, since they are similar with the proofs presented in jlOj). 



Equation (10) is used to derive the upper bound of the growth factor of 
CALILPRRP from the upper bound of the growth factor of LILPRRP. The 
elimination of each row of the first panel using CALILPRRP can be obtained by 
performing LLLPRRP on a matrix of maximum dimension m x b(H + 1). Hence 
the upper bound of the growth factor obtained after one step of CALLLPRRP 
is (1 + rb) H+1 . This leads to an upper bound of (1 + r&)^^ H+1 - )_1 for a matrix 
of size m x n. 

Table [6] summarizes the bounds of the growth factor of CALU_PRRP derived 
in this section, and also recalls the bounds of LILPRRP, GEPP, and CALU 
the communication avoiding version of GEPP. It considers the growth factor 
obtained after the elimination of b columns of a matrix of size m x (6+1), 
and also the general case of a matrix of size m x n. As discussed in section [2] 
already, LLLPRRP is more stable than GEPP in terms of worst case growth 
factor. From Table [6j it can be seen that for a reduction tree of a same height, 
CALU_PRRP is more stable than CALU. 

In the following we show that CALLLPRRP can be more stable than GEPP 
in terms of worst case growth factor. Consider a parallel version of CALLLPRRP 
based on a binary reduction tree of height H — log(P), where P is the number of 

processors. The upper bound of the growth factor becomes (1 + rb) s L, 
which is smaller than 2"( z °f p + 1 )- 1 j the upper bound of the growth factor of 
CALU. For example, if the threshold is r = 2, the panel size is b — 64, and 
the number of processors is P = 128 = 2 7 , then gw calu -PRRP = (1-7)". This 
quantity is much smaller than 2 7n the upper bound of CALU, and even smaller 
than the worst case growth factor of GEPP of 2™ _1 . In general, the upper 
bound of CALU_PRRP can be smaller than the one of GEPP, if the different 
parameters r, H , and b are chosen such that the condition 

H ~ (log 6 + log r) (12) 
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is satisfied. For a binary tree of height H — log P, it becomes log P < (i og j,+i ogT ) ■ 
This is a condition which can be satisfied in practice, by choosing 6 and r ap- 
propriately for a given number of processors P. For example, when P < 512, 
b = 64, and r = 2, the condition ( 12 ) is satisfied, and the worst case growth 
factor of CALILPRRP is smaller than the one of GEPP. 

However, for a sequential version of CALILPRRP using a flat tree of height 

,2 

H = n/b, the condition to be satisfied becomes n < (i og b+iog t) > wm ch is more 
restrictive. In practice, the size of b is chosen depending on the size of the 
memory, and it might be the case that it will not satisfy the condition in equation 
pi. 



Table 6: Bounds for the growth factor gw obtained from factoring a matrix of 
size m x (6+1) and a matrix of size m x n using CALILPRRP, LU-PRRP, 
CALU, and GEPP. CALILPRRP and CALU use a reduction tree of height H. 
The strong RRQR used in LU_PRRP and CALILPRRP is based on a threshold 
t. For the matrix of size m x (6 + 1), the result corresponds to the growth factor 
obtained after eliminating 6 columns. 





TSLU(b,H) 


matrix of size m x 
LU_PRRP(b,H) 


(b+1) 
GEPP 


LU_PRRP 


gw upper bound 




(l + rb) H+1 


2 b 


1 + rb 




CALU 


matrix of size rr 
CALILPRRP 


x n 
GEPP 


LU_PRRP 


gw upper bound 


2n(H+X)-l 


(1 + rb) » 1 


2 n-l 


(1 + t&)t 



3.3 Experimental results 

In this section we present experimental results and show that CALU_PRRP 
is stable in practice and compare them with those obtained from CALU and 
GEPP in [10]. We present results for both the binary tree scheme and the flat 
tree scheme. 

As in section [2j we perform our tests on matrices whose elements follow a 
normal distribution. In Matlab notation, the test matrix is A = randn(rt,n), 
and the right hand side is 6 = randn(n, 1). The size of the matrix is chosen such 
that n is a power of 2, that is n — 2 k , and the sample size is 10 if k < 13 and 3 
if k > 13. To measure the stability of CALU_PRRP, we discuss several metrics, 
that concern the LU decomposition and the linear solver using it, such as the 
growth factor, normwise and componentwise backward errors. We also perform 
tests on several special matrices including sparse matrices, they are described 
in Appendix B. 

Figure [3] displays the values of the growth factor gw of the binary tree based 
CALU_PRRP, for different block sizes 6 and different number of processors P. 
As explained in section |3.1| the block size determines the size of the panel, 
while the number of processors determines the number of block rows in which 
the panel is partitioned. This corresponds to the number of leaves of the binary 
tree. We observe that the growth factor of binary tree based CALU-PRRP is 
in the most of the cases better than GEPP. The curves of the growth factor lie 
between ^n 1 / 2 and In 1 / 2 in our tests on random matrices. These results show 
that binary tree based CALU_PRRP is stable and the growth factor values 
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obtained for the different layouts are better than those obtained with binary 
tree based CALU. The figure [3] includes also the growth factor of the LU-PRRP 
method with a panel of size b — 64. We note that that results are better than 
those of binary tree based CALLLPRRP. 




Figure 3: Growth factor gw of binary tree based CALU_PRRP for random 
matrices. 

Figure [4] displays the values of the growth factor gw for flat tree based 
CALU_PRRP with a block size b varying from 8 to 128. The growth factor gw 
is decreasing with increasing the panel size b. We note that the curves of the 
growth factor lie between \n 1 / 2 and ^n 1 / 2 in our tests on random matrices. We 
also note that the results obtained with the LU_PRRP method with a panel of 
size b = 64 are better than those of flat tree based CALILPRRP. 

The growth factors of both binary tree based and flat tree based C ALU_PRRP 
have similar (sometimes better) behavior than the growth factors of GEPP. 

Table [19] in Appendix B presents results for the linear solver using binary 
tree based CALILPRRP, together with binary tree based CALU and GEPP for 
comparaison and Table [20| in Appendix B presents results for the linear solver 
using flat tree based CALU_PRRP, together with flat tree based CALU and 
GEPP for comparaison. We note that for the binary tree based CALU_PRRP, 
when ?p = b, for the algorithm we only use PI = 75^53 processors, since to 
perform a Strong RRQR on a given block, the number of its rows should be 
at least the number of its columns +1. Tables [T9l and [20l also include results 
obtained by iterative refinement used to improve the accuracy of the solution. 
For this, the componentwise backward error in equation pi is used. In the 
previous tables, Wb denotes the componentwise backward error before iterative 
refinement and Nm denotes the number of steps of iterative refinement. Nm is 
not always an integer since it represents an average. For all the matrices tested 
CALU_PRRP leads to results as accurate as the results obtained with CALU 
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and GEPP. 

In Appendix B we present more detailed results. There we include some 
other metrics, such as the norm of the factors, the norm of the inverse of the fac- 
tors, their conditioning, the value of their maximum element, and the backward 
error of the LU factorization. Through the results detailed in this section and in 
Appendix B we show that binary tree based and flat tree based CALU-PRRP 
are stable, have the same behavior as GEPP for random matrices, and are more 
stable than binary tree based and flat tree based CALU in terms of growth 
factor. 

Figure [5] summarizes all our stability results for the CALILPRRP factoriza- 
tion based on both binary tree and flat tree schemes. As figure [2j this figure 
displays the ratio of the maximum between the backward error and machine cp- 
silon of LU.PRRP versus GEPP. The backward error is measured as the relative 
error ||Pyl — iJ7||/||A||, the normwise backward error 77, and the componentwise 
backward error w. Results for all the matrices in our test set are presented, 
that is 25 random matrices for binary tree base CALU-PRRP from Table [23} 20 
random matrices for flat tree based CALLLPRRP from Table [21] and 37 special 
matrices from Tables [22] and [24] As it can be seen, nearly all ratios are between 
0.5 and 2.5 for random matrices. However there are few outliers, for example 
the relative error ratio has values between 24.2 for the special matrix hadamard 
(GEPP is more stable than binary tree based CALILPRRP), and 5.8 x 10~ 3 for 
the special matrix moler (binary tree based CALLLPRRP is more stable than 
GEPP). 

We consider now the same set of pathological matrices as in section [2] on 
which GEPP fails. For the Wilkinson matrix, both CALU and CALU_PRRP 
based on flat and binary tree give modest element growth. For the generalized 
Wilkinson matrix, the Foster matrix, and Wright matrix, CALU fails with both 
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Figure 5: A summary of all our experimental data, showing the ratio of 
max(CALU_PRRP's backward error, machine epsilon) to max(GEPP's back- 
ward error, machine epsilon) for all the test matrices in our set. Each vertical 
bar represents such a ratio for one test matrix. Bars above 10° = 1 mean that 
CALlLPRRP's backward error is larger, and bars below 1 mean that GEPP's 
backward error is larger. For each matrix and algorithm, the backward error 
is measured using three different metrics. For the last third of the bars, la- 
beled "componentwise backward error", the metric is w in equation (J8|. The 
test matrices are further labeled either as "randn" , which are randomly gener- 
ated, or "special" , listed in Table [15] Finally, each test matrix is factored using 
CALU J>RRP with a binary reduction tree (labeled BCALU for BCALILPRRP) 
and with a flat reduction tree (labeled FCALU for FCALU_PRRP). 



flat tree and binary tree reduction schemes. 

Tables [7] and [8] present the results obtained for the linear solver using the 
CALU_PRRP factorization based on flat and binary tree schemes for a gener- 
alized Wilkinson matrix of size 2048 with a size of the panel varying from 8 to 
128 for the flat tree scheme and a number of processors varying from 128 to 32 
for the binary tree scheme. The growth factor is of order 1 and the quantity 
^ Fj |l^n * S 0n ^ e orc ^ er °f 10 16 - Both flat tree based CALU and binary tree 
based CALU fail on this pathological matrix. In fact for a generalized Wilkinson 
matrix of size 1024 and a panel of size b = 128, the growth factor obtained with 
flat tree based CALU is of size 10 234 . 



Table 7: Stability of the flat tree based CALUJPRRP factorization of a gener- 
alizcd Wilkinson matrix on which GEPP fails. 



n 


b 


9w 


\\U\\i 


ll^lli 


Plli 


ll^lli 


\\PA-LU\\ F 
\\A\\ F 


2048 


128 


2.01 


l.Olc+03 


1.40c+02 


1.31c+03 


9.76e+02 


9.56e-16 


64 


2.02 


1.18e+03 


1.64e+02 


1.27e+03 


1.16e+03 


1.01e-15 


32 


2.04 


8.34e+02 


1.60c+02 


1.30c+03 


7.44c+02 


7.91e-16 


16 


2.15 


9.10e+02 


1.45c+02 


1.31e+03 


8.22c+02 


8.07e-16 


8 


2.15 


8.71e+02 


1.57c+02 


1.371c+03 


5.46c+02 


6.09e-16 



For the Foster matrix, we have seen in the section [2] that LU_PRRP gives 
modest pivot growth, whereas GEPP fails. Both flat tree based CALU and 
binary tree based CALU fail on the Foster matrix. However flat tree based 
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Table 8: Stability of the binary tree based CALU-PRRP factorization of a 
generalized Wilkinson matrix on which GEPP fails. 



n 


P 


b 


9w 


nil 


ll^lli 


\\L\\i 


ll^lli 


\\PA-LU\\ F 
\\M\ F 




128 


8 


2.10e+00 


1.33c+03 


1.29c+02 


1.34e+03 


1.33e+03 


1.08e-15 




64 


16 


2.04e+00 


6.85e+02 


1.30e+02 


1.30e+03 


6.85e+02 


7.85e-16 


2048 


8 


8.78e+01 


1.21c+03 


1.60c+02 


1.33e+03 


l.Olc+03 


9.54e-16 






32 


2.08e+00 


9.47e+02 


1.58e+02 


1.41e+03 


9.36e+02 


5.95e-16 




32 


16 


2.08e+00 


1.24c+03 


1.32e+02 


1.35e+03 


1.24e+03 


1.01e-15 






8 


1.45c+02 


1.03c+03 


1.54c+02 


1.37e+03 


6.61c+02 


6.91e-16 



CALU-PRRP and binary tree based CALU-PRRP solve easily this pathological 
matrix. 

Tables [9] and [10] present results for the linear solver using the C ALU-PRRP 
factorization based on the flat tree scheme and the binary tree scheme, respec- 
tively. We test a Foster matrix of size 2048 with a panel size varying from 8 
to 128 for the fiat tree based C ALU-PRRP and a number of processors vary- 
ing from 128 to 32 for the binary tree based CALU_PRRP. We use the same 
parameters as in sectional that is c = 1, h = 1 and k = |. According to the 
obtained results, CALU_PRRP gives a modest growth factor of 1.33 for this 
practical matrix. 



Table 9: Stability of the flat tree based CALU_PRRP factorization of a practical 
matrix (Foster) on which GEPP fails. 



n 


b 


9w 


\\u\h 


llt^Hi 


\\L\h 


ll^lli 


\\PA-LU\\ F 


2048 


128 


1.33 


1.71c+02 


1.87c+00 


1.92c+03 


1.29c+02 


6.51c-17 


64 


1.33 


8.60e+01 


1.87c+00 


1.98c+03 


6.50e+01 


4.87e-17 


32 


1.33 


4.33c+01 


1.87c+00 


2.01c+03 


3.30c+01 


2.91c-17 


16 


1.33 


2.20c+01 


1.87c+00 


2.03c+03 


1.70c+01 


4.80e-17 


8 


1.33 


1.13c+01 


1.87c+00 


2.04c+03 


9.00e+00 


6.07e-17 



Table 10: Stability of the binary tree based CALU_PRRP factorization of a 
practical matrix (Foster) on which GEPP fails. 



n 


P 


b 


9w 


\\U\\x 


ll^lli 


llilli 


lli-'lli 


\\PA-LU\\ F 
\\A\\f 


2048 


128 


8 


1.33 


1.13c+01 


1.87c+00 


2.04c+03 


9.00c+00 


6.07c-17 


64 


16 


1.33 


2.20c+01 


1.87c+00 


2.03c+03 


lJOc+01 


4.80c-17 


8 


1.33 


1.13c+01 


1.87e+00 


2.04e+03 


9.00e+00 


6.07c-17 


32 


32 


1.33 


4.33c+01 


1.87e+00 


2.01c+03 


3.300c+01 


2.91c-17 


16 


1.33 


2.20e+01 


1.87e+00 


2.03c+03 


1.70e+01 


4.80e-17 


8 


1.33 


1.13e+01 


1.87e+00 


2.04e+03 


9.00e+00 


6.07e-17 



As GEPP, both the flat tree based and the binary tree based CALU fail on 
the Wright matrix. In fact for a matrix of size 2048, a parameter h = 0.3, with 
a panel of size b = 128, the flat tree based CALU gives a growth factor of 10 98 . 
With a number of processors P — 64 and a panel of size b = 16, the binary 
tree based CALU also gives a growth factor of 10 98 . Tables 11 and 12 present 
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results for the linear solver using the CALLLPRRP factorization for a Wright 
matrix of size 2048. For the flat tree based CALU-PRRP, the size of the panel 
is varying from 8 to 128. For the binary tree based CALLLPRRP, the number 
of processors is varying from 32 to 128 and the size of the panel from 16 to 
64 such that the number of rows in the leaf nodes is equal or bigger than two 
times the size of the panel. The obtained results, show that CALU_PRRP gives 
a modest growth factor of 1 for this practical matrix, compared to the CALU 
method. 



Table 11: Stability of the flat tree based CALU-PRRP factorization of a prac- 
tical matrix (Wright) on which GEPP fails. 



n 


b 


gw 


\\U\\i 


ll^lli 


IWIi 


ll^lli 


\\PA-LU\\ F 
\\A\\f 




128 


l 


3.25c+00 


8.00c+00 


2.00c+00 


2.00c+00 


4.08e-17 




64 


l 


3.25c+00 


8.00c+00 


2.00c+00 


2.00e+00 


4.08c-17 


2048 


32 


l 


3.25c+00 


8.00c+00 


2.05c+00 


2.02e+00 


6.65e-17 




16 


l 


3.25e+00 


8.00e+00 


2.32e+00 


2.18e+00 


1.04e-16 




8 


l 


3.40c+00 


8.00c+00 


2.62c+00 


2.47e+00 


1.26e-16 



Table 12: Stability of the binary tree based CALLLPRRP factorization of a 
practical matrix (Wright) on which GEPP fails. 



n 


P 


b 


gw 


\W\\i 


lie/" 1 !!! 


ll^lli 


ll^lli 


\\PA-LU\\ F 
\\A\\f 




128 


8 


l 


3.40e+00 


8.00c+00 


2.62c+00 


2.47c+00 


1.26c-16 




64 


16 


l 


3.25c+00 


8.00c+00 


2.32c+00 


2.18c+00 


1.04e-16 


2048 


8 


l 


3.40e+00 


8.00e+00 


2.62e+00 


2.47e+00 


1.26e-16 






32 


l 


3.25e+00 


8.00c+00 


2.05c+00 


2.02c+00 


6.65e-17 




32 


16 


l 


3.25e+00 


8.00e+00 


2.32c+00 


2.18c+00 


1.04c-16 






8 


l 


3.40e+00 


8.00c+00 


2.62c+00 


2.47c+00 


1.26c-16 



All the previous tests show that the CALLLPRRP factorization is very stable 
for random and more special matrices, and it also gives modest growth factor 
for the pathological matrices on which CALU fails, this is for both binary tree 
and flat tree based CALU_PRRP. 



4 Lower bounds on communication 

In this section we focus on the parallel CALU_PRRP algorithm based on a bi- 
nary reduction tree, and we show that it minimizes the communication between 
different processors of a parallel computer. For this, we use known lower bounds 
on the communication performed during the LU factorization of a dense matrix 
of size n x n, which are 



# words_movcd = - , (13) 

1 Im ' 



r? 



# messages = tl ( — ^- ) , (14) 
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where # words -moved refers to the volume of communication, # messages refers 
to the number of messages exchanged, and M refers to the size of the memory 
(the fast memory in the case of a sequential algorithm, or the memory per 
processor in the case of a parallel algorithm). These lower bounds were first 
introduced for dense matrix multiplication |15| . [16] . generalized later to LU 
factorization [3], and then to almost all direct linear algebra pQ. Note that 
these lower bounds apply to algorithms based on orthogonal transformations 
under certain conditions 1 . However, this is not relevant to our case, since 
CALILPRRP uses orthogonal transformations only to select pivot rows, while 
the update of the trailing matrix is still performed as in the classic LU factor- 



ization algorithm. Hence the lower bounds from equations (13) and, (14) are 
valid for CALLLPRRP. 

We estimate now the cost of computing in parallel the CALILPRRP fac- 
torization of a matrix A of size m x n. The matrix is distributed on a grid of 
P = P r x P c processors using a two-dimensional (2D) block cyclic layout. We 
use the following performance model. Let 7 be the cost of performing a floating 
point operation, and let a + /3w be the cost of sending a message of size w words, 
where a is the latency cost and /3 is the inverse of the bandwidth. Then, the 
total running time of an algorithm is estimated to be 

a ■ (# messages) + /?•(# words_moved) + 7 • (# flops), 

where ^messages, #words_moved, and #flops are counted along the critical 
path of the algorithm. 



Table 13 displays the performance of parallel CALU-PRRP (a detailed es- 
timation of the counts is presented in Appendix D). It also recalls the perfor- 
mance of two existing algorithms, the PDGETRF routine from ScaLAPACK 
which implements GEPP, and the CALU factorization. All three algorithms 
have the same volume of communication, since it is known that PDGETRF 
already minimizes the volume of communication. However, the number of mes- 
sages of both CALILPRRP and CALU is smaller by a factor of the order of 
b than the number of messages of PDGETRF. This improvement is achieved 
thanks to tournament pivoting. In fact, partial pivoting, as used in the routine 
PDGETRF, leads to an 0(n\ogP) number of messages, and because of this, 
GEPP cannot minimize the number of messages. 

Compared to CALU, CALU_PRRP sends a small factor of less messages 
(depending on P r and P c ) and performs -p- (2mn — n 2 ) b + n |~(51og 2 P r + 1) 
more flops (which represents a lower order term). This is because CALU_PRRP 
uses the strong RRQR factorization at every node of the reduction tree of every 
panel factorization, while CALU uses GEPP. 

Despite this additional communication cost, we show now that CALU_PRRP 
is optimal in terms of communication. We choose optimal values of the param- 
eters P r , P c , and b, as used in CAQR [4] and CALU [10], that is, 

„ mP „ nP , , 1 , 9 / mP \ Iran , 9 / mP 
Pr = \ ,P C = J— and b = - log- 2 [J .</— = log- 2 

For a square matrix of size n x n, the optimal parameters are, 

P r = VP , P C = VP and b = - log" 2 (Vp) ■ = log" 2 (P) • -^L. 

4 V / yP VP 
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Table 13: Performance estimation of parallel (binary tree based) CALU_PRRP, 
parallel CALU, and PDGETRF routine when factoring anmXn matrix, m> n. 
The input matrix is distributed using a 2D block cyclic layout on a P r x P c grid 
of processors. Some lower order terms are omitted. 



Parallel CALILPRRP 



# messages 

# words 

# flops 



IT !°g2 Pr+f l0g 2 Pc 

nb +lpl) lo e2Pr + jr(r 

n2 - if) + w ( 2mn - 



log 2 P c 



log 2 P r 



Parallel CALU 



j log 2 P r + f log 2 P c 
(n6+|^)log 2 P r + i 



# messages 

# words 

# flops 



■>p,. 

2 



+ 



2mn - 



- \) log 2 P c 



2P C 



+ ^(51og 2 P r 



# messages 

# words 

# flops 



PDGETRF 

2n (1 + f) log 2 P r + f log 2 P c 

(f + l^) l0 g2^+l0g 2 P c ^ 



M- 2 -t 



^ 2P C 



Table [14] presents the performance estimation of parallel C ALU_PRRP and 
parallel CALU when using the optimal layout. It also recalls the lower bounds 
on communication from equations ( 13 ) and ( 14 1 when the size of the memory 
per processor is on the order of n 2 /P. Both CALU-PRRP and CALU attain 
the lower bounds on the number of words and on the number of messages, 
modulo polylogarithmic factors. Note that the optimal layout allows to reduce 
communication, while keeping the number of extra floating point operations 
performed due to tournament pivoting as a lower order term. While in this 
section we focused on minimizing communication between the processors of a 
parallel computer, it is straightforward to show that the usage of a flat tree 
during tournament pivoting allows CALU_PRRP to minimize communication 
between different levels of the memory hierarchy of a sequential computer. 



Table 14: Performance estimation of parallel (binary tree based) CALU_PRRP 
and CALU with an optimal layout. The matrix factored is of size n x n. Some 
lower-order terms are omitted. 





Parallel CALU.PRRP with optimal layout 


Lower bound 


# messages 

# words 

# flops 


IVFiog 3 P 

^i(ll g-lp + l 0g p) 

1 2n 3 , 5n 3 , 5n 3 
P 3 ' 2Plo K 2 P 1 3Plo K 3 P 


n(^p) 

1 2n 3 
P 3 




Parallel CALU with optimal layout 


Lower bound 


# messages 

# words 

# flops 


3\/Plog 3 P 

^(llog^P + logP) 

1 2n 3 | 3n 3 . 5n 3 
P 3 ' 2Plog 2 P 1 CP log 3 P 


fi(v / P) 

°(#) 

1 2n 3 
P 3 
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5 Less stable factorizations that can also mini- 
mize communication 

In this section, we present briefly two alternative algorithms that are based 
on panel strong RRQR pivoting and that are conceived such that they can 
minimize communication. But we will see that they can be unstable in practice. 
These algorithms are also based on block algorithms, that factor the input 
matrix by traversing panels of size b. The main difference between them and 
CALILPRRP is the panel factorization, which is performed only once in the 
alternative algorithms. 

We present first a parallel alternative algorithm, which we refer to as block 
parallel LILPRRP. At each step of the block factorization, the panel is parti- 
tioned into P block-rows [Aq ; Ax', ■ ■ ■ ; Ap- 1] . The blocks below the diagonal bxb 
block of the current panel are eliminated by performing a binary tree of strong 
RRQR factorizations. At the leaves of the tree, the elements below the diagonal 
block of each block A; are eliminated using strong RRQR. The elimination of 
each such block row is followed by the update of the corresponding block row of 
the trailing matrix. The algorithm continues by performing the strong RRQR 
factorization of pairs oibxb blocks stacked atop one another, until all the blocks 
below the diagonal block are eliminated and the corresponding trailing matrices 
are updated. The algebra of the block parallel LILPRRP algorithm is detailed 
in Appendix E, while in figure [6] we illustrate one step of the factorization by 
using an arrow notation, where the function <?(Ay) computes a strong RRQR 
on the matrix Ajj and updates the trailing matrix in the same step. 

g(A 10 )/ \ 
9{A 3 o) / 



Figure 6: Block parallel LU-PRRP 



A sequential version of the algorithm is based on the usage of a flat tree, 
and we refer to this algorithm as block pairwise LU_PRRP. Using the arrow 
notation, the figure [7] illustrates the elimination of one panel. 



-420 
-430 




Figure 7: Block pairwise LU.PRRP 



The block parallel LU_PRRP and the block pairwise LU_PRRP algorithms 
have similarities with the block parallel pivoting and the block pairwise pivoting 
algorithms. These two latter algorithms were shown to be potentially unstable 
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in |10j . There is a main difference between all these alternative algorithms and 
algorithms that compute a classic LU factorization as GEPP, LU-PRRP, and 
their communication avoiding variants. The alternative algorithms compute a 
factorization in the form of a product of lower triangular factors and an upper 
triangular factor. And the elimination of each column leads to a rank update of 
the trailing matrix larger than one. It is thought in [20 that the rank-1 update 
property of algorithms that compute an LU factorization inhibits potential ele- 
ment growth during the factorization, while a large rank update might lead to 
an unstable factorization. 

Note however that at each step of the factorization, block parallel and block 
pairwise LU_PRRP use at each level of the reduction tree original rows of the 
active matrix. Block parallel pivoting and block pairwise pivoting algorithms 
use U factors previously computed to achieve the factorization, and this could 
potentially lead to a faster propagation of ill-conditioning. 




Figure 8: Growth factor of block parallel LU_PRRP for varying block size b 
and number of processors P. 

The upper bound of the growth factor of both block parallel and block pair- 
wise LU_PRRP is (1 + t6)t , since for every panel factorization, a row is updated 
only once. Hence they have the same bounds as the LU_PRRP factorization, 
and smaller than that of the CALU_PRRP factorization. Despite this, they 
are less stable than the CALU_PRRP factorization. Figures [8] and [9] display 
the growth factor of block parallel LU_PRRP and block pairwise LU_PRRP for 
matrices following a normal distribution. In figure [8j the number of processors 
P on which each panel is partitioned is varying from 16 to 32, and the block size 
b is varying from 2 to 16. The matrix size varies from 64 to 2048, but we have 
observed the same behavior for matrices of size up to 8192. When the number 
of processors P is equal to 1, the block parallel LU_PRRP corresponds to the 
LU-PRRP factorization. The results show that there are values of P and b for 
which this method can be very unstable. For the sizes of matrices tested, when 
b is chosen such that the blocks at the leaves of the reduction tree have more 
than 2b rows, the number of processors P has an important impact, the growth 
factor increases with increasing P, and the method is unstable. 

In Figure [9j the matrix size varies from 1024 to 8192. For a given matrix 
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size, the growth factor increases with decreasing the size of the panel b, as one 
could expect. We note that the growth factor of block pairwise LU-PRRP is 
larger than that obtained with the CALU_PRRP factorization based on a flat 
tree scheme presented in Table [21] But it stays smaller than the size of the 
matrix n for different panel sizes. Hence this method is more stable than block 
parallel LU_PRRP. Further investigation is required to conclude on the stability 
of these methods. 




1024 2046 4096 8192 

matrix sire 



Figure 9: Growth factor of block pairwise LILPRRP for varying matrix size 
and varying block size b. 
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6 Conclusions 

This paper introduces LILPRRP, an LU factorization algorithm based on panel 
rank revealing pivoting. This algorithm is more stable than GEPP in terms of 
worst case growth factor. It is also very stable in practice for various classes of 
matrices, including pathological cases on which GEPP fails. 

Its communication avoiding version, CALLLPRRP, is also more stable in 
terms of worst case growth factor than CALU, the communication avoiding 
version of GEPP. More importantly, there are cases of interest for which the 
upper bound of the growth factor of CALU_PRRP is smaller than that of GEPP 
for several cases of interest. Extensive experiments show that CALLLPRRP is 
very stable in practice and leads to results of the same order of magnitude as 
GEPP, sometimes even better. 

Our future work focuses on two main directions. The first direction investi- 
gates the design of a communication avoiding algorithm that has smaller bounds 
on the growth factor than that of GEPP in general. The second direction focuses 
on estimating the performance of CALL_PRRP on parallel machines based on 
multicore processors, and comparing it with the performance of CALL. 
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Appendix A 

We describe briefly strong RRQR introduced by M. Gu and S. Eisenstat in [12]. 
This factorization will be used in our new LU decomposition algorithm, which 
aims to obtain an upper bound of the growth factor smaller than GEPP (see 
Section[2]) Consider a given threshold r > 1 and an/ixp matrix B with p > h, 
a Strong RRQR factorization on a matrix B gives (with an empty (2, 2) block) 

B T H = QR = Q[ R n R 12 ] , 

where Ilii^-K^ || maz < T : with |j . \\ max being the biggest entry of a given ma- 
trix in absolute value. This factorization can be computed by a classical QR 
factorization with column pivoting followed by a limited number of additional 
swaps and QR updates if necessary. 



Algorithm 2 Strong RRQR 
1: Compute B T H = QR using the classical RRQR with column pivoting 
2: while there exist i and j such that |(i?xi -^12)*} I > T do 
3: Set II = nn.y and compute the QR factorization of R Ily (QR updates) 
4: end while 

Ensure: B T Ii = QR with H-R^-Riallmaa < r 



The while loop in Algorithm [2] interchanges any pairs of columns that can 
increase \det(Rn)\ by at least a factor r. At most 0(log T n) such interchanges 
are necessary before Algorithm [2] finds a strong RRQR factorization. The QR 
factorization of B T H can be computed numerically via efficient and numerically 
stable QR updating procedures. See [12] for details. 



Appendix B 



We present experimental results for the LU-PRRP factorization, the binary tree 
based CALILPRRP, and the flat tree based CALILPRRP. We show results 
obtained for the LU decomposition and the linear solver. Tables [T6] [21] and [23] 
display the results obtained for random matrices. They show the growth factor, 
the norm of the factor L and U and their inverses, and the relative error of the 
decomposition. 

Tables [l7j [18] [22] and [24] display the results obtained for the special matrices 
presented in Table[T5] The size of the tested matrices is n = 4096. For LU_PRRP 
and flat tree based CALU_PRRP, the size of the panel is b = 8. For binary tree 
based CALU_PRRP we use P = 64 and b — 8, this means that the size of the 
matrices used at the leaves of the reduction tree is 64 x 8. 
Tables |T9] and [20] present results for the linear solver using binary tree based and 
flat tree based CALU_PRRP, together with CALU and GEPP for comparison. 
The tables are presented in the following order. 



Table 16 Stability of the LU decomposition for LU_PRRP and GEPP on 



random matrices. 

Table [17] Stability of the LU decomposition for GEPP on special matrices. 
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• Table [TSj Stability of the LU decomposition for LU_PRRP on special 
matrices. 

• Table [19} Stability of the linear solver using binary tree based CALILPRRP, 
binary tree based CALU, and GEPP. 

• Table [20] Stability of the linear solver using flat tree based CALU_PRRP, 
fiat tree based CALU, and GEPP. 

• Table|2lj Stability of the LU decomposition for flat tree based CALU_PRRP 
and GEPP on random matrices. 

• Table[22} Stability of the LU decomposition for fiat tree based CALU_PRRP 
on special matrices. 

• Table [23} Stability of the LU decomposition for binary tree based CALU_PRRP 
and GEPP on random matrices. 



Table 24 Stability of the LU decomposition for binary tree based CALU_PRRP 
on special matrices. 



Table 15: Special matrices in our test set. 



No. 

1 

2 
3 
4 
5 



9 
10 



Matrix 

hadamard 

house 
parter 



kms 

toeppen 

condex 

moler 

circul 
randcorr 



11 poisson 

12 hankel 

13 jordbloc 

14 compan 

15 pei 

16 randcolu 

17 sprandn 

18 riemann 



Remarks 

Hadamard matrix, hadamard(n), where n, n/12, or n/20 is power of 
2. 

Householder matrix, A = eye(n) — f3 * v * v' , where [v,f), s] = 
gallery ('house', randn(n, 1)). 

Parter matrix, a Toeplitz matrix with most of singular values near n. 
gallery('parter', n), or A(i,j) — — j + 0.5). 

Ris matrix, matrix with elements A(i,j) = 0.5/(n — i — j + 1.5). The 
eigenvalues cluster around ~n/2 and it/2, gallery ('ris', n). 
Kac-Murdock-Szego Toeplitz matrix. Its inverse is tridiagonal. 
gallery('kms', n) or gallery ('kms', n, rand). 
Pentadiagonal Toeplitz matrix (sparse). 

Counter-example matrix to condition estimators, gallery ('condex', n). 
Moler matrix, a symmetric positive definite (spd) matrix, 
gallery ('moler', n). 

Circulant matrix, gallery ('circul', randn(n, 1)). 

Random n x n correlation matrix with random eigenvalues from 
a uniform distribution, a symmetric positive semi-definite matrix. 
gallery('randcorr', n). 

Block tridiagonal matrix from Poisson's equation (sparse), A = 
gallery ( 'poisson' ,sqrt (n) ) . 

Hankel matrix, A = hankel(c, r), where c=randn(n, 1), r=randn(n, 1), 

and c(n) = r(l). 

Jordan block matrix (sparse). 

Companion matrix (sparse), A — compan(randn(n+l,l)). 

Pei matrix, a symmetric matrix. gallery('pei', n) or gallery('pei', n, 

randn) . 

Random matrix with normalized cols and specified singular values. 
gallery('randcolu', n). 

Sparse normally distributed random matrix, A — sprandn(n, n,0.02). 
Matrix associated with the Riemann hypothesis, gallery ('riemann', n). 
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19 compar Comparison matrix, gallery('compar', randn(n), unidrnd(2) — 1). 

20 tridiag Tridiagonal matrix (sparse). 

21 chebspec Chebyshev spectral differentiation matrix, gallery('chebspec', n, 1). 

22 lehmer Lehmer matrix, a symmetric positive definite matrix such that 

A(i,j) = i/j for j > i. Its inverse is tridiagonal. gallery('lehmer', 
n). 

23 toeppd Symmetric positive semi-definite Toeplitz matrix, gallery ( 'toeppd', n). 

24 minij Symmetric positive definite matrix with A(i, j) = min(i,j). 

gallery('minij', n). 

25 randsvd Random matrix with preassigned singular values and specified band- 

width, gallery ('randsvd', n). 

26 forsythe Forsythe matrix, a perturbed Jordan block matrix (sparse). 

27 fiedler Fiedler matrix, gallery('fiedler', n), or gallery ('fiedler', randn(n, 1)). 

28 dorr Dorr matrix, a diagonally dominant, ill-conditioned, tridiagonal matrix 

(sparse). 

29 demmel A = D*(eye(n) + 10 _7 *rand(n)), where D = diag(10 14 * (0: ™- 1)/ ") 0. 

30 chebvand Chebyshev Vandermonde matrix based on n equally spaced points on 

the interval [0, 1]. gallery ('chebvand', n). 

31 invhess A=gallery('invhess', n, rand(n— 1, 1)). Its inverse is an upper Hessen- 

berg matrix. 

32 prolate Prolate matrix, a spd ill-conditioned Toeplitz matrix, gallery ('prolate', 

n). 

33 frank Frank matrix, an upper Hessenberg matrix with ill-conditioned eigen- 

values. 

34 cauchy Cauchy matrix, gallery ('cauchy', randn(n, 1), randn(n, 1)). 

35 hilb Hilbert matrix with elements + j — 1). A =hilb(n). 

36 lotkin Lotkin matrix, the Hilbert matrix with its first row altered to all ones. 

gallery('lotkin', n). 

37 kahan Kahan matrix, an upper trapezoidal matrix. 
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Table 19: Stability of the linear solver using binary tree based CALU_PRRP, 
binary tree based CALU, and GEPP. 



n 


P 


b 


V 


w b 


N IR 


HPL1 


HPL2 


HPL3 


Binary tree based CALLLPRRP 


8192 


ZOO 


32 


7.5E-15 


4.4E-14 


2 


6.2E-02 


2.3E-02 


4.4E-03 


16 


6.7E-15 


4.1E-14 


2 


3.4E-02 


2.2E-02 


4.6E-03 


128 


64 


7.6E-15 


4.7E-14 


2 


2.6E-02 


2.5E-02 


4.9E-03 


32 


7.5E-15 


4.9E-14 


2 


6.9E-02 


2.6E-02 


4.9E-03 


16 


7.3E-15 


5.1E-14 


2 


4.2E-02 


2.7E-02 


5.7E-03 


64 


128 


7.6E-15 


5.0E-14 


2 


2.8E-02 


2.6E-02 


5.2E-03 


64 


7.9E-15 


5.3E-14 


2 


6.2E-02 


2.8E-02 


5.9E-03 


32 


7.8E-15 


5.0E-14 


2 


7.0E-02 


2.6E-02 


5.0E-03 


16 


6.7E-15 


5.0E-14 


2 


4.2E-02 


2.7E-02 


5.7E-03 


4096 


256 


16 


3.5E-15 


2.2E-14 


2 


5.8E-02 


2.3E-02 


5.1E-03 


128 


32 


3.8E-15 


2.3E-14 


2 


5.3E-02 


2.4E-02 


5.1E-03 


16 


3.6E-15 


2.2E-14 


1.6 


1.3E-02 


2.4E-02 


5.1E-03 


64 


64 


4.0E-15 


2.3E-14 


2 


3.8E-02 


2.4E-02 


4.9E-03 


32 


3.9E-15 


2.4E-14 


2 


1.2 E-02 


2.5E-02 


5.7E-03 


16 


3.8E-15 


2.4E-14 


1.6 


2.3E-02 


2.5E-02 


5.2E-03 


2048 


128 


16 


1.8E-15 


1.0E-14 


2 


8.1E-02 


2.2E-02 


4.8E-03 


64 


32 


1.8E-15 


1.2E-14 


2 


9.7E-02 


2.5E-02 


5.6E-03 


16 


1.9E-15 


1.2E-14 


1.8 


3.4E-02 


2.5E-02 


5.4E-03 


1024 


64 


16 


1.0E-15 


6.3E-15 


1.3 


7.2E-02 


2.5E-02 


6.1E-03 


Binary tree based CALU 


8192 


256 


32 


6.2E-15 


4.1E-14 


2 


3.6E-02 


2.2E-02 


4.5E-03 


16 


5.8E-15 


3.9E-14 


2 


4.5E-02 


2.1E-02 


4.1E-03 


128 


64 


6.1E-15 


4.2E-14 


2 


5.0E-02 


2.2E-02 


4.6E-03 


32 


6.3E-15 


4.0E-14 


2 


2.5E-02 


2.1E-02 


4.4E-03 


16 


5.8E-15 


4.0E-14 


2 


3.8E-02 


2.1E-02 


4.3E-03 


64 


128 


5.8E-15 


3.6E-14 


2 


8.3E-02 


1.9E-02 


3.9E-03 


64 


6.2E-15 


4.3E-14 


2 


3.2E-02 


2.3E-02 


4.4E-03 


32 


6.3E-15 


4.1E-14 


2 


4.4E-02 


2.2E-02 


4.5E-03 


16 


6.0E-15 


4.1E-14 


2 


3.4E-02 


2.2E-02 


4.2E-03 


4096 


256 


16 


3.1E-15 


2.1E-14 


1.7 


3.0E-02 


2.2E-02 


4.4E-03 


128 


32 


3.2E-15 


2.3E-14 


2 


3.7E-02 


2.4E-02 


5.1E-03 


16 


3.1E-15 


1.8E-14 


2 


5.8E-02 


1.9E-02 


4.0E-03 


64 


64 


3.2E-15 


2.1E-14 


1.7 


3.1E-02 


2.2E-02 


4.6E-03 


32 


3.2E-15 


2.2E-14 


1.3 


3.6E-02 


2.3E-02 


4.7E-03 


16 


3.1E-15 


2.0E-14 


2 


9.4E-02 


2.1E-02 


4.3E-03 


2048 


128 


16 


1.7E-15 


1.1E-14 


1.8 


6.9E-02 


2.3E-02 


5.1E-03 


64 


32 


1.7E-15 


1.0E-14 


1.6 


6.5E-02 


2.1E-02 


4.6E-03 


16 


1.6E-15 


1.1E-14 


1.8 


4.7E-02 


2.2E-02 


4.9E-03 


1024 


64 


16 


8.7E-16 


5.2E-15 


1.6 


1.2E-1 


2.1E-02 


4.7E-03 


GEPP 


8192 




3.9E-15 


2.6E-14 


1.6 


1.3E-02 


1.4E-02 


2.8E-03 


4096 




2.1E-15 


1.4E-14 


1.6 


1.8E-02 


1.4E-02 


2.9E-03 


2048 




1.1E-15 


7.4E-15 


2 


2.9E-02 


1.5E-02 


3.4E-03 


1024 




6.6E-16 


4.0E-15 


2 


5.8E-02 


1.6E-02 


3.7E-03 
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Table 20: Stability of the linear solver using flat tree based CALILPRRP, flat 



tree based CALU, and GEPP. 



n 


P 


b 


V 


w b 


N IR 


HPL1 


HPL2 


HPL3 


Flat tree based CALU_PRRP 


8096 




8 


6.2E-15 


3.8E-14 




1.7E-02 


2.0E-02 


4.0E-03 




16 


6.6E-15 


4.3E-14 




3.6E-02 


2.2E-02 


4.3E-03 




32 


7.2E-15 


4.8E-14 




6.5E-02 


2.5E-02 


5.3E-03 




64 


7.3E-15 


5.1E-14 




4.8E-02 


2.7E-02 


5.4E-03 


4096 




8 


3.2E-15 


2.0E-14 




6.7E-02 


2.1E-02 


4.7E-03 




16 


3.6E-15 


2.3E-14 




2.9E-02 


2.4E-02 


5.1E-03 




32 


3.8E-15 


2.4E-14 




4.6E-02 


2.5E-02 


5.5E-03 




64 


3.7E-15 


2.5E-14 




1.7E-1 


2.6E-02 


5.6E-03 


2048 




8 


1.4E-15 


1.1E-14 




1.3E-1 


2.3E-02 


5.1E-03 




16 


1.9E-15 


1.1E-14 




1.6E+0 


2.3 E-02 


5.3E-03 




32 


2.1E-15 


1.3E-14 




2.5E-02 


2.7E-02 


5.8E-03 




64 


1.9E-15 


1.25E-14 




1.2E-1 


2.6E-02 


5.9E-03 


1024 




8 


9.4E-16 


5.5E-15 




3.8E-02 


2.2E-02 


5.3E-03 




16 


1.0E-15 


6.0E-15 




6.2E-02 


2.4E-02 


5.4E-03 




32 


1.0E-15 


5.6E-15 




4.2E-02 


2.2E-02 


5.4E-03 




64 


1.0E-15 


6.8E-15 




3.6E-02 


2.7E-02 


6.7E-03 




Flat tree based CALU 


8096 




8 


4.5E-15 


3.1E-14 


1.7 


4.4E-02 


1.6E-02 


3.4E-03 




16 


5.6E-15 


3.7E-14 


2 


1.9E-02 


2.0E-02 


3.3E-03 




32 


6.7E-15 


4.4E-14 


2 


4.6E-02 


2.4E-02 


4.7E-03 




64 


6.5E-15 


4.2E-14 


2 


5.5E-02 


2.2E-02 


4.6E-03 


4096 




8 


2.6E-15 


1.7E-14 


1.3 


1.3E-02 


1.8E-02 


4.0E-03 




16 


3.0E-15 


1.9E-14 


1.7 


2.6E-02 


2.0E-02 


3.9E-03 




32 


3.8E-15 


2.4E-14 


2 


1.9E-02 


2.5E-02 


5.1E-03 




64 


3.4E-15 


2.0E-14 


2 


6.0E-02 


2.1E-02 


4.1E-03 


2048 




8 


1.5E-15 


8.7E-15 


1.6 


2.7E-02 


1.8E-02 


4.2E-03 




16 


1.6E-15 


1.0E-14 


2 


2.1E-1 


2.1E-02 


4.5E-03 




32 


1.8E-15 


1.1E-14 


1.8 


2.3E-1 


2.3E-02 


5.1E-03 




64 


1.7E-15 


1.0E-14 


1.2 


4.1E-02 


2.1E-02 


4.5E-03 


1024 




8 


7.8E-16 


4.9E-15 


1.6 


5.5E-02 


2.0E-02 


4.9E-03 




16 


9.2E-16 


5.2E-15 


1.2 


1.1E-1 


2.1E-02 


4.8E-03 




32 


9.6E-16 


5.8E-15 


1.1 


1.5E-1 


2.3E-02 


5.6E-03 




64 


8.7E-16 


4.9E-15 


1.3 


7.9E-02 


2.0E-02 


4.5E-03 


GEPP 


8192 




3.9E-15 


2.6E-14 


1.6 


1.3E-02 


1.4E-02 


2.8E-03 


4096 




2.1E-15 


1.4E-14 


1.6 


1.8E-02 


1.4E-02 


2.9E-03 


2048 




1.1E-15 


7.4E-15 


2 


2.9E-02 


1.5E-02 


3.4E-03 


1024 




6.6E-16 


4.0E-15 


2 


5.8E-02 


1.6E-02 


3.7E-03 
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Appendix C 

Here we summarize the floating-point operation counts for the LILPRRP algorithm 
performed on an input matrix A of size m x n, where m > n . We first focus on the 
step k of the algorithm, that is we consider the k th panel of size (m — (k — 1)6) x 6. We 
first perform a Strong RRQR on the transpose of the considered panel, then update 
the trailing matrix of size (m — kb) x (n — kb), finally we perform GEPP on the diagonal 
block of size b x b. We assume that m — (k — 1)6 > 6 + 1. The Strong RRQR performs 
nearly as many floating-point operations as the QR with column pivoting. Here we 
consider that is the same, since in practice performing QR with column pivoting is 
enough to obtain the bound r, and thus it is 

Flops SRRQR,lblock,stepk = 2(m — (k — 1)6)6 2 — -6 3 . 

If we consider the update step |3|, then the flops count is 

Flops update , st£p k = 26(m — kb)(n — kb). 
For the additional GEPP on the diagonal block, the flops count is 

Flops gepP:3tepk = |& 3 + (n - fc6)6 2 . 

Then the flops count for the step k is 

Flops LU_PRRP,ste P k = b 2 (2m + n — 3(fc — 1)6) — 6 3 + 26(m — kb)(n — kb). 

This gives us an arithmetic operation count of 

Flops LU _p RRP (m,n,b) = e| =1 [6 2 (2m + n - 2{k - 1)6 

FlopsLu_PRRp(m, n, 6) = rati 2 + 2mnb + 2nb 2 — ^n 2 b 

~ rati 2 — -n 3 + 2mnb — -n 2 b. 
3 2 

Then for a square matrix of size n x n, the flops count is 

Flops L u_p RR p(n,n,b) = ^n 3 + ^n 2 6 + 2n& 2 ~ ^n 3 + ^n 2 6. 



kb) + 2b(m~kb)(n-kb)] 



Appendix D 

Here we detail the performance model of the parallel version of the CALU_PRRP 
factorization performed on an input matrix A of size m x n where, m > n. We 
consider a 2D layout P — P r x P c . We first focus on the panel factorization for the 
block LU factorization, that is the selection of the b pivot rows with the tournament 
pivoting strategy. This step is similar to CALU except that the reduction operator is 
Strong RRQR instead of GEPP, then for each panel the amount of communication is 
the same as for TSLU: 

# messages = log P r 
# words = 6 2 log P r 

However the floating-point operations count is different. We consider as in Appendix 
C that Strong RRQR performs as many flops as QR with columns pivoting, then 
the panel factorization performs the QR factorization with columns pivoting on the 
transpose of the blocks of the panel and logP r reduction steps: 
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# flops = 2 m = | 1)b b 2 - |b 3 + lo g p.(2(26)6 2 - |fo 3 ) 

To perform the QR factorization without pivoting on the transpose of the panel, and 
the update of the trailing matrix: 

• broadcast the pivot information along the rows of the process grid. 

# messages = log P c 
# words = 6 log P c 

• apply the pivot information to the original rows. 

# messages = logP r 

# words = ^ log P r 

tc 

• Compute the block column L and broadcast it through blocks of columns 

# messages = logP c 

# words = m ^ j Q g p c 

# flops = 2 — 6 

• broadcast the upper block of the permuted matrix A through blocks of rows 

# messages = logP r 

# words = n j, j Q g 

J c 

• perform a rank-b update of the trailing matrix 

„a m-kbn-kb 

# flops = 26 — — 

1 r r c 

Thus to get the block LU factorization : 

# messages = ^ log P r + ^ log P c 
6 6 

li 2 n, ,-, / , 3 n 2 
#words = ( — - + n )logP c + (n6+-— )logP, 

# flops = ^(mn 2 - ^n 3 ) + ?n6 2 (51ogP r - 1) + -^-(4mn + 2nb- 2n) 

~ i(mn ! - ^n 3 ) + |n6 2 (51ogP r - 1) + ^-(ran - y)6 

Then to obtain the full LU factorization, for each 6x6 block, we perform the Gaussian 
elimination with partial pivoting and we update the corresponding trailing matrix of 
size 6 x n — kb. During this additional step, we first perform GEPP on the diagonal 
block, broadcast pivot rows through column blocks (this broadcast can be done to- 
gether with the previous broadcast of L, thus there is no additional message to send), 
apply pivots, and finally compute U. 

# words = n(l + |)logP c 
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# flops = g 



2 , -j n — kb,n 

-b H = — 6 

3 P c 



2,2,1 n 2 & 
3 n& + 2^ 



Finally the total count is : 
3n 



# messages 
# words 

# flops 



2n 

logP r + — logP, 



mn 
mn 



In nb „ . , „ , , 3 n . , _, 
2jT + Y+ 2n ) logP c + (nb+ -— )logP r 

I^)logP c + (n6+|^)logP r 



I 2 



1 3 , 4 , 

Q n )+ ^-( mn - 



n 2 n 2 & 10 2 

T )b+ 2R + y nfe logP " 



Appendix E 



Here we detail the algebra of the block parallel LU-PRRP. At the first iteration, the 
matrix A has the following partition 



A = 



An 
A 2 i 



A12 
A 22 



The block An is of size m/p x 6, where p is the number of processors used and 
m/p > b+ 1, the block A12 is of size m/p x n — b, the block A21 is of size m — m/p x 6, 
and the block A22 is of size m — m/p x n — b. 

To describe the algebra, we consider 4 processors and a block of size b. We first 
focus on the b first columns, 



A(:,l:6) 



A 
Ai 
A 2 
A 3 



Each block Ai is of size m/p x b. In the following we describe the different steps of the 
panel factorization. First, we perform Strong RRQR factorization on the transpose of 
each block A; so we obtain : 



A Hoo = Q00P00 
Aj Ilio = Q10P10 
A 2 r n 2 o = Q20P20 
A3II30 = Q30P30 

Each matrix Ri is of size b x m/p. Using M ATLAB notations, we can write Ri as 
following : 



P, = Pi(l : b, 1 : 6) Pi = Pi(l : b, b + 1 : m/p) 



This step aims to eliminate the last m/p — b rows of each block Ai. We define the 
matrix PA)IIo : 
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where 



Doo — Roo(R () q) T , 
Dio = Rio{Ru)) T ' i 

D20 = R2o(R.2o) T > 
D30 = #30 (#30 ) • 



Multiplying A(:,l : 6) by -Dollo we obtain : 



A>n xA(:,l:b) = 





- (IlJ, x Ao)(l ■ M 


b) - 




Aoi 




Om/p-fe 






Om / p — b 




(nf X Ai)(l : 6,1 


b) 




An 




Om/p-6 






Om /p — b 


6) = 


(nii, X A 2 )(l : 6,1 


b) 




A 2 i 




Om/p-fe 






Om /p — b 




(nio x A 3 )(l : 6,1 


b) 




A 3 i 




Om/p-6 






Om/p-b 



The second step corresponds to the second level of the reduction tree. We merge 
pairs of 6 x 6 blocks and as in the previous step we perform Strong RRQR factorization 
on the transpose of the 26 x 6 blocks : 



and 



We obtain 



A 01 
A 11 



A 2 i 
A31 



We note : 



Aoi 
An 
A21 
A 31 



II01 = Q01R01, 



IIii = QuRu- 



R01 = Roi(l : 6, 1 : 6) R 01 = R i(l : 6,6 + 1 : 26), 
R11 = fln(l : 6, 1 : 6) Ru = Ru(l : 6, 6 + 1 : 26). 

As in the previous level, we aim to eliminate b rows in each block, so we consider the 
matrix 
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-An 



*m./p — b 



D1II1 = 



h 



i/p — b 



where 



-Du 



lm/p — b 



Lm/p — b 



Doi — Roi(R () i) T , 
Du — Rj 1 (R ±) T . 

The matrices floi and flu are the permutations corresponding to the Strong RRQR 
factorizations of the two 26 x 6 blocks. The matrices IL31 and Iln can easily be deduced 
from the matrices floi and flu extended by the appropriate identity matrices to get 
matrices of size 2m jp x 2m/ 'p. 



The multiplication of the block DqYIqA(\, 1 : 6) with the matrix DiIIi leeds to 



£>iniA>n A(:,l : b) = 



This matrix can be written as 



An 
An 

02m/p-6 



(1 : 6, 1 : b) 



An x 



A 2 i 
A 3 i 

02m/p — 6 



(1 : 6, 1 : b) 



DiUiDoUoAi:, 1 : b) = 



ngi x 



An X 



X A )(l :b,l:b) 
(nf X Ai)(l:6,l:6) 

02m/p-6 

(nioX A 2 (l:6,l:&)) 
(IlS)X^3(l:6,l:6)) 

02m /p — b 



(1:6,1:6) 



(1 : 6, 1 : 6) 







A>2 

2m/ p- 



A 12 

^2m/p- 



For the final step we consider the last 2b x 6 block 



' A)2 " 

_ A12 . ■ 

We perform a Strong RRQR factorization on the transpose of this block and then 
we obtain 
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A>2 

A 12 



II02 = Qo2^?02- 



We note 



R02 = Ro2(l : 6, 1 : 6) R02 = -R 02 (l : 6, 6 + 1 : 26). 



We define the matrix D 2 II 2 as 



D2II2 



t-m/p—b 



-A. 



*-m/p — b 



i/p — b 



h 



Lm/p — b 



x ILi 



where 



A2 — Ro2(R()2 ) i 



and the permutation matrix II02 can easily be deduced from the permutation matrix 

n ()2 . 

The multiplication of the block DiHiDqI1oA(:, 1 : b) by the matrix D2II2 leeds to : 



DiUiD^DoUoAi:, 1 : 6) 



■R021 Q 02 

04m/p — 6 



A 2 
^4m/p—b 



A, 



03 







'4m/p — 6 



If we consider the block A(:,l : b) of the beginning and all the steps performed, 
we get : 



D 2 n 2 An 1 A)noA(:,l : b) = 



A 02 

A X 2 





(1 : b, 1 : b) 



We can also write 



L> 2 n 2 AniA)noA(:,i = b) 





Hoi x 


S02 x 


ftn x 



4m/p— 6 

(IlS) x A )(l :6,1 :b) 
(nf X Ai)(l : 6,1 : fa) 

(Il£,xA a )(l:M:6) 
(nio x A 3 )(l : 6,1 : 6) 

^4m/p — b 



(1 : 6, 1 : 6) 



(1 : 6, 1 : 6) 



(1 : 6, 1 : 6) 
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We can also write 



A(:, 1:6) = (DanaDiniDoIIor 1 









Hoi x 


no2 x 






An x 



Then, we have 



A(:, 1:6) = I#£>o 1 II?'Dr 1 Il! , D^ 



where 



n 



n„ 









Hoi x 


no2 x 






fin x 


nio 





(Il£, x : 6,1 : b) 

(n5,xAi)(l:6,l:6) 
(n^ X A 2 )(l : 6,1 : 6) 
(nioX A 3 )(l:6,l:6) 

^Am/p — b 

(IIS, x A )(l : 6,1 : 6) 
(nT x Ai)(l : 6,1 : 6) 
(nio x A 2 )(l : 6,1 : 6) 
(nioX A 3 )(l:6,l:b) 



(1 : 6, 1 : 6) 



(1 : 6, 1 : 6) 



(1 : 6, 1 : 6) 



(1 : 6, 1 : 6) 



(1 : 6, 1 : 6) 



(1 : 6, 1 : 6) 







4m/p — b 



n 20 



n 30 



h 

Doo Im/p-b 



h 

Dio I n 



/ P -b 



It 

D20 Im/p-b 



lb 

D30 Im/p-b 

For each Strong RRQR performed previously, the corresponding trailing matrix is up- 
dated at the same time. Thus, at the end of the process, we have 



A = n„ A7 ln T D^Ill D^ 1 x 



^03 A12 
^im/p-b A22 

where A 03 is the 6x6 diagonal block containing the b selected pivot rows from the 
current panel. An additional GEPP should be performed on this block to get the full 
LU factorization. A22 is the trailing matrix already updated, and on which the block 
parallel LU-PRRP should be continued. 



Appendix F 

Here is the matlab code for generating the matrix T used in Section 2 to define the 
generalized Wilkinson matrix on which GEPP fails. For any given integer r > 0, this 
generalized Wilkinson matrix is an upper triangular semi-separable matrix with rank 
at most r in all of its submatrices above the main diagonal. The entries are all negative 
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and are chosen randomly. The Wilkinson matrix is for the special case where r = 1 
and every entry above the main diagonal is 1. 

function [A] = counterexample_GEPP(n,r ,u, v) ; 

"/, Function count erexample_GEPP generates a matrix which fails 

"/, GEPP in terms of large element growth. 

'/, This is a generalization of the Wilkinson matrix. 

7. 

if (nargin == 2) 

u = rand(n.r) ; 

v = rand(n.r) ; 

A = - triu(u * v' ) ; 

for k = 2:n 

umax = max(abs (A(k-1 ,k:n) ) ) * (1 + 1/n) ; 

A(k-l,k:n) = A(k-l,k:n) / umax; 

end 

A = A - diag(diag(A)) ; 
A = A' + eye(n) ; 
A(l:n-l,n) = ones(n-l , 1) ; 

else 

A = triu(u * v' ) ; 
A = A - diag(diag(A)) ; 
A = A' + eye(n) ; 
A(l:n-l,n) = ones(n-l,l); 

end 
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